Text-to-image search: when “what it looks like” beats “what it says”

25,000 Unsplash photos, one SigLIP 2 model, three rank profiles. A practical look at why semantic image search picks up where keyword search runs out of vocabulary — and how Vespa puts both behind one query.

TL;DR - why this matters

Most photo libraries have great images and patchy text. If you ship BM25 over photographer-written tags, half your catalogue is dark: the photo of a moody forest at sunrise can't be retrieved by anyone typing “moody”, “sunrise”, or “forest” unless the photographer happened to mention those exact words. Semantic search bypasses the text entirely — it asks the model “does this image look like the description?”

We run all three rank profiles side-by-side on the same corpus so the difference is literally a colour swap on the page:

Semantic (SigLIP 2 embeddings) — the model knows what photos look like.
Lexical (BM25 over text fields) — the classic baseline.
Hybrid (Vespa global-phase blend with a live α slider) — the union, reranked.

All on a single Vespa app, one schema, one rank-profile family. The model is google/siglip2-base-patch16-224; the index is standard HNSW over a 768-d float vector.

The setup

The page shows two columns running the same query through different rank profiles:

Semantic (SigLIP 2) — encode the query with SigLIP 2's text tower, find the closest image embeddings. No keyword overlap required.
Lexical (BM25) — match query terms against the photographer's description and the Unsplash AI-generated alt text. Classic information retrieval.
Hybrid — retrieve from both, then rerank in Vespa's global-phase with normalize_linear. The α slider blends the two scores live.

Behind it: the 25k photo Unsplash Lite research dataset, embedded once with google/siglip2-base-patch16-224 (768-d, L2-normalised), indexed in Vespa with HNSW + BM25 on the same documents.

Why not just BM25?

BM25 is the keyword search every team already has. It's fast, well-understood, and when your data has rich captions it's often enough. The catch for photo libraries: photographer-written descriptions are mostly absent or terse, and the auto-generated alt text is good at objects (“person, tree, snow”) but bad at styles (“moody”, “minimalist”), moods(“vintage”, “cozy”), or compositions(“close-up”, “wide shot”, “rule of thirds”).

Below: rank of the photo a human evaluator would pick as the top result, by mode. Lexical-only buries half the queries past position 10; semantic puts almost all of them at #1 or #2.

Rank of the “correct” photo for each query (1 = top hit, >20 = missed the top-20).

Illustrative rows — these come from manual relevance judgements over the demo's seed queries. The eval harness in eval/ will replace them with numbers from the dataset's conversions.csv (real query → click events) once that lands.

Why SigLIP 2 over plain CLIP

CLIP (OpenAI, 2021) is the model most teams reach for when they want a shared text-image embedding space. It works. But SigLIP 2 (Google, 2025) replaces CLIP's softmax contrastive loss with a sigmoid loss that doesn't require global batch statistics, which lets the model train better on noisy data and lets its similarity scores read as well-calibrated probabilities rather than relative rankings. In practice for retrieval that means:

Better top-K accuracy on standard benchmarks (COCO, Flickr30k) at the same parameter count.
Multilingual out of the box — the same model handles English, Spanish, Japanese queries equally.
Apache 2.0 license, unlike OpenAI CLIP which is research-only-licensed.

Drop-in shape: 768-d output, prenormalized, cosine-similarity-friendly. Vespa's schema doesn't care which one you ship — the model choice lives entirely in the encoder.

field embedding type tensor<float>(x[768]) {
    indexing: attribute | index
    attribute { distance-metric: prenormalized-angular }
    index { hnsw { max-links-per-node: 16, neighbors-to-explore-at-insert: 200 } }
}

The three rank profiles

Three strategies, all in schemas/photo.sd:

profile	retrieval	ranking	tests
`semantic`	`nearestNeighbor(embedding, q)`	cosine similarity	SigLIP-only baseline
`lexical`	`userQuery()` (weakAnd)	BM25 over description + ai_description	keyword-only baseline
`hybrid`	both, unioned	α·normalize_linear(semantic) + (1−α)·normalize_linear(lexical)	recommended

Wiring the global-phase blend

The hybrid retrieval unions the nearestNeighbor candidates with the BM25 matches. The two scores live on different scales (cosine ∈ [-1, 1], BM25 ∈ [0, ∞)), so we normalise both to [0, 1] across the rerank window before mixing:

rank-profile hybrid inherits semantic {
    inputs {
        query(q)     tensor<float>(x[768])
        query(alpha) double
    }
    function semantic_score() { expression: closeness(field, embedding) }
    function lexical_score()  { expression: bm25(description) + bm25(ai_description) }

    first-phase  { expression: semantic_score }
    global-phase {
        rerank-count: 200
        expression: query(alpha) * normalize_linear(semantic_score)
                  + (1 - query(alpha)) * normalize_linear(lexical_score)
    }
}

Two things to call out:

normalize_linear rescales a feature to [0, 1] using min/max across the rerank set — it's the simplest "make these two scores comparable" primitive Vespa offers, and it works without any per-query calibration.
query(alpha) is a tensor input — the same mechanism used for query embeddings, user vectors, MRL slice sizes. Push it from the client; no redeploy needed to tune the blend.

Storage at 25k photos

The footprint is dominated by the inlined JPEGs — about 3.7 GB once you sum the full-resolution image bytes Vespa serves directly. The first cut of this demo hot-linked images.unsplash.com from the browser; the result-grid felt slow because every hit fired a fresh CDN round-trip per thumbnail. Backfilling the images into a raw field (full_image on t2i_photo, served through a named with-image document-summary so search responses don't ship 150 KB × hits) flipped Vespa into the role of image origin and the latency went away. Everything else is rounding error next to the JPEGs:

At 25k photos the whole thing — JPEGs, embeddings, indices — fits comfortably on one t3-sized content node. At 1M+ photos you'd either keep hot-linking the CDN (the original design, cheaper at scale) or pair the inlining with the quantisation trick the RIS demo uses — sign-bit quantise the embedding to a 96-byte binary, use Hamming HNSW for first-phase, full-precision float for rerank. The schema pattern is identical; only the indexing pipeline changes.

Methodology

Corpus: 25,000 photos from Unsplash Lite research dataset, photo metadata + the 755k-row keyword association table.
Model: google/siglip2-base-patch16-224, 768-d output, L2-normalised at index time.
Image fetch at feed time: pulled from the Unsplash CDN at 384px (resized via the CDN's w= param) — SigLIP 2 downsamples to 224 internally, so 384 source is more than enough.
Image delivery: a separate 1200px JPEG per photo is stored inline on the doc as a raw field; the backend's /images/{doc_id} endpoint reads it via the with-image summary and serves it with a long browser-cache header. Un-backfilled docs fall back to a 302 redirect to the original CDN URL.
Eval (planned): conversions.csv from the Unsplash dataset is a real query → click event log. The harness in eval/evaluate.py will treat each clicked photo as gold-positive for its query and score recall@K + MRR per profile.

Takeaways

Semantic search is BM25's upgrade path, not a replacement. The two answer slightly different questions: BM25 finds documents whose text matches the query; semantic finds documents whose content matches the query intent. Photo libraries are the canonical case where the second matters more.
The schema doesn't pick winners. One photo.sd defines three rank profiles. Switching modes is a query parameter, not a redeploy. The UI's mode toggle and α slider drive ranking entirely; no backend changes required.
Vespa's global-phase + normalize_linear is the cleanest hybrid primitive we know. Per-query min/max normalisation, no learned calibration, no offline tuning. Mix two unrelated signals in one expression.
SigLIP 2 is the better default than OpenAI CLIP in 2026 for new builds — same shape, better quality, friendlier license, multilingual out of the box. If you're prototyping a text-to-image search today, start here.

Vespa Cloud · SigLIP 2 · Unsplash Lite · Dataset on GitHub