Your text
Anything you'd describe a photo as — a scene, a mood, a colour palette, a verb. The query goes in raw, no preprocessing beyond tokenization.
A natural-language search over Unsplash Lite from Unsplash. SigLIP 2 maps your query into the same 768-dim space as the images, then Vespa returns the closest neighbours — no caption required.
Anything you'd describe a photo as — a scene, a mood, a colour palette, a verb. The query goes in raw, no preprocessing beyond tokenization.
The text tower of SigLIP 2 base patch16/224 maps your query to a single 768-dim vector — the same space the images live in. We L2-normalise so cosine and dot-product agree.
Vespa walks an HNSW graph over the image embeddings using prenormalized-angular distance. Sub-second latency at 25k photos; the same shape works to billions.
The closest image embeddings come back, ranked by similarity to the text query. Switch to lexical or hybrid to see how it compares to BM25 on the photographer-written descriptions.