VespaText-to-image search

Search by vibe,
not by tag.

A natural-language search over Unsplash Lite from Unsplash. SigLIP 2 maps your query into the same 768-dim space as the images, then Vespa returns the closest neighbours — no caption required.

SigLIP 2 base768d · float32Vespa HNSWcosine similarity

loading examples…

Read the article →

Under the hoodhow a text-to-image search runs

Your text

Anything you'd describe a photo as — a scene, a mood, a colour palette, a verb. The query goes in raw, no preprocessing beyond tokenization.

processor(text="moody forest at sunrise", padding="max_length")

SigLIP 2 text encoder

The text tower of SigLIP 2 base patch16/224 maps your query to a single 768-dim vector — the same space the images live in. We L2-normalise so cosine and dot-product agree.

model.text_model(**batch).pooler_output

Vespa nearest-neighbor

Vespa walks an HNSW graph over the image embeddings using prenormalized-angular distance. Sub-second latency at 25k photos; the same shape works to billions.

{targetHits:200}nearestNeighbor(embedding, q)

Top-K photos

The closest image embeddings come back, ranked by similarity to the text query. Switch to lexical or hybrid to see how it compares to BM25 on the photographer-written descriptions.

rank-profile semantic · first-phase

Search by vibe,not by tag.

Your text

SigLIP 2 text encoder

Vespa nearest-neighbor

Top-K photos

Search by vibe,
not by tag.