Text-to-image search: when “what it looks like” beats “what it says”
25,000 Unsplash photos, one SigLIP 2 model, three rank profiles. A practical look at why semantic image search picks up where keyword search runs out of vocabulary — and how Vespa puts both behind one query.
The setup
The page shows two columns running the same query through different rank profiles:
- Semantic (SigLIP 2) — encode the query with SigLIP 2's text tower, find the closest image embeddings. No keyword overlap required.
- Lexical (BM25) — match query terms against the photographer's description and the Unsplash AI-generated alt text. Classic information retrieval.
- Hybrid — retrieve from both, then rerank in Vespa's global-phase with
normalize_linear. The α slider blends the two scores live.
Behind it: the 25k photo Unsplash Lite research dataset, embedded once with google/siglip2-base-patch16-224 (768-d, L2-normalised), indexed in Vespa with HNSW + BM25 on the same documents.
Why not just BM25?
BM25 is the keyword search every team already has. It's fast, well-understood, and when your data has rich captions it's often enough. The catch for photo libraries: photographer-written descriptions are mostly absent or terse, and the auto-generated alt text is good at objects (“person, tree, snow”) but bad at styles (“moody”, “minimalist”), moods(“vintage”, “cozy”), or compositions(“close-up”, “wide shot”, “rule of thirds”).
Below: rank of the photo a human evaluator would pick as the top result, by mode. Lexical-only buries half the queries past position 10; semantic puts almost all of them at #1 or #2.
Illustrative rows — these come from manual relevance judgements over the demo's seed queries. The eval harness in eval/ will replace them with numbers from the dataset's conversions.csv (real query → click events) once that lands.
Why SigLIP 2 over plain CLIP
CLIP (OpenAI, 2021) is the model most teams reach for when they want a shared text-image embedding space. It works. But SigLIP 2 (Google, 2025) replaces CLIP's softmax contrastive loss with a sigmoid loss that doesn't require global batch statistics, which lets the model train better on noisy data and lets its similarity scores read as well-calibrated probabilities rather than relative rankings. In practice for retrieval that means:
- Better top-K accuracy on standard benchmarks (COCO, Flickr30k) at the same parameter count.
- Multilingual out of the box — the same model handles English, Spanish, Japanese queries equally.
- Apache 2.0 license, unlike OpenAI CLIP which is research-only-licensed.
Drop-in shape: 768-d output, prenormalized, cosine-similarity-friendly. Vespa's schema doesn't care which one you ship — the model choice lives entirely in the encoder.
field embedding type tensor<float>(x[768]) {
indexing: attribute | index
attribute { distance-metric: prenormalized-angular }
index { hnsw { max-links-per-node: 16, neighbors-to-explore-at-insert: 200 } }
}The three rank profiles
Three strategies, all in schemas/photo.sd:
| profile | retrieval | ranking | tests |
|---|---|---|---|
semantic | nearestNeighbor(embedding, q) | cosine similarity | SigLIP-only baseline |
lexical | userQuery() (weakAnd) | BM25 over description + ai_description | keyword-only baseline |
hybrid | both, unioned | α·normalize_linear(semantic) + (1−α)·normalize_linear(lexical) | recommended |
Wiring the global-phase blend
The hybrid retrieval unions the nearestNeighbor candidates with the BM25 matches. The two scores live on different scales (cosine ∈ [-1, 1], BM25 ∈ [0, ∞)), so we normalise both to [0, 1] across the rerank window before mixing:
rank-profile hybrid inherits semantic {
inputs {
query(q) tensor<float>(x[768])
query(alpha) double
}
function semantic_score() { expression: closeness(field, embedding) }
function lexical_score() { expression: bm25(description) + bm25(ai_description) }
first-phase { expression: semantic_score }
global-phase {
rerank-count: 200
expression: query(alpha) * normalize_linear(semantic_score)
+ (1 - query(alpha)) * normalize_linear(lexical_score)
}
}Two things to call out:
normalize_linearrescales a feature to [0, 1] using min/max across the rerank set — it's the simplest "make these two scores comparable" primitive Vespa offers, and it works without any per-query calibration.query(alpha)is a tensor input — the same mechanism used for query embeddings, user vectors, MRL slice sizes. Push it from the client; no redeploy needed to tune the blend.
Storage at 25k photos
The footprint is dominated by the inlined JPEGs — about 3.7 GB once you sum the full-resolution image bytes Vespa serves directly. The first cut of this demo hot-linked images.unsplash.com from the browser; the result-grid felt slow because every hit fired a fresh CDN round-trip per thumbnail. Backfilling the images into a raw field (full_image on t2i_photo, served through a named with-image document-summary so search responses don't ship 150 KB × hits) flipped Vespa into the role of image origin and the latency went away. Everything else is rounding error next to the JPEGs:
At 25k photos the whole thing — JPEGs, embeddings, indices — fits comfortably on one t3-sized content node. At 1M+ photos you'd either keep hot-linking the CDN (the original design, cheaper at scale) or pair the inlining with the quantisation trick the RIS demo uses — sign-bit quantise the embedding to a 96-byte binary, use Hamming HNSW for first-phase, full-precision float for rerank. The schema pattern is identical; only the indexing pipeline changes.
Methodology
- Corpus: 25,000 photos from Unsplash Lite research dataset, photo metadata + the 755k-row keyword association table.
- Model:
google/siglip2-base-patch16-224, 768-d output, L2-normalised at index time. - Image fetch at feed time: pulled from the Unsplash CDN at 384px (resized via the CDN's
w=param) — SigLIP 2 downsamples to 224 internally, so 384 source is more than enough. - Image delivery: a separate 1200px JPEG per photo is stored inline on the doc as a
rawfield; the backend's/images/{doc_id}endpoint reads it via thewith-imagesummary and serves it with a long browser-cache header. Un-backfilled docs fall back to a 302 redirect to the original CDN URL. - Eval (planned):
conversions.csvfrom the Unsplash dataset is a real query → click event log. The harness ineval/evaluate.pywill treat each clicked photo as gold-positive for its query and score recall@K + MRR per profile.
Takeaways
- Semantic search is BM25's upgrade path, not a replacement. The two answer slightly different questions: BM25 finds documents whose text matches the query; semantic finds documents whose content matches the query intent. Photo libraries are the canonical case where the second matters more.
- The schema doesn't pick winners. One
photo.sddefines three rank profiles. Switching modes is a query parameter, not a redeploy. The UI's mode toggle and α slider drive ranking entirely; no backend changes required. - Vespa's global-phase +
normalize_linearis the cleanest hybrid primitive we know. Per-query min/max normalisation, no learned calibration, no offline tuning. Mix two unrelated signals in one expression. - SigLIP 2 is the better default than OpenAI CLIP in 2026 for new builds — same shape, better quality, friendlier license, multilingual out of the box. If you're prototyping a text-to-image search today, start here.
Vespa Cloud · SigLIP 2 · Unsplash Lite · Dataset on GitHub