Text-to-image search: when “what it looks like” beats “what it says”

25,000 Unsplash photos, one SigLIP 2 model, three rank profiles. A practical look at why semantic image search picks up where keyword search runs out of vocabulary — and how Vespa puts both behind one query.

The setup

The page shows two columns running the same query through different rank profiles:

Behind it: the 25k photo Unsplash Lite research dataset, embedded once with google/siglip2-base-patch16-224 (768-d, L2-normalised), indexed in Vespa with HNSW + BM25 on the same documents.

Why not just BM25?

BM25 is the keyword search every team already has. It's fast, well-understood, and when your data has rich captions it's often enough. The catch for photo libraries: photographer-written descriptions are mostly absent or terse, and the auto-generated alt text is good at objects (“person, tree, snow”) but bad at styles (“moody”, “minimalist”), moods(“vintage”, “cozy”), or compositions(“close-up”, “wide shot”, “rule of thirds”).

Below: rank of the photo a human evaluator would pick as the top result, by mode. Lexical-only buries half the queries past position 10; semantic puts almost all of them at #1 or #2.

11020moody forest at sunrise>201minimalist desk setup142vintage film camera31rainy window with neon>201cat sleeping in a sunbeam81barista pouring latte art12semantic (SigLIP)lexical (BM25)
Rank of the “correct” photo for each query (1 = top hit, >20 = missed the top-20).

Illustrative rows — these come from manual relevance judgements over the demo's seed queries. The eval harness in eval/ will replace them with numbers from the dataset's conversions.csv (real query → click events) once that lands.

Why SigLIP 2 over plain CLIP

CLIP (OpenAI, 2021) is the model most teams reach for when they want a shared text-image embedding space. It works. But SigLIP 2 (Google, 2025) replaces CLIP's softmax contrastive loss with a sigmoid loss that doesn't require global batch statistics, which lets the model train better on noisy data and lets its similarity scores read as well-calibrated probabilities rather than relative rankings. In practice for retrieval that means:

Drop-in shape: 768-d output, prenormalized, cosine-similarity-friendly. Vespa's schema doesn't care which one you ship — the model choice lives entirely in the encoder.

field embedding type tensor<float>(x[768]) {
    indexing: attribute | index
    attribute { distance-metric: prenormalized-angular }
    index { hnsw { max-links-per-node: 16, neighbors-to-explore-at-insert: 200 } }
}

The three rank profiles

Three strategies, all in schemas/photo.sd:

profileretrievalrankingtests
semanticnearestNeighbor(embedding, q)cosine similaritySigLIP-only baseline
lexicaluserQuery() (weakAnd)BM25 over description + ai_descriptionkeyword-only baseline
hybridboth, unionedα·normalize_linear(semantic) + (1−α)·normalize_linear(lexical)recommended

Wiring the global-phase blend

The hybrid retrieval unions the nearestNeighbor candidates with the BM25 matches. The two scores live on different scales (cosine ∈ [-1, 1], BM25 ∈ [0, ∞)), so we normalise both to [0, 1] across the rerank window before mixing:

rank-profile hybrid inherits semantic {
    inputs {
        query(q)     tensor<float>(x[768])
        query(alpha) double
    }
    function semantic_score() { expression: closeness(field, embedding) }
    function lexical_score()  { expression: bm25(description) + bm25(ai_description) }

    first-phase  { expression: semantic_score }
    global-phase {
        rerank-count: 200
        expression: query(alpha) * normalize_linear(semantic_score)
                  + (1 - query(alpha)) * normalize_linear(lexical_score)
    }
}

Two things to call out:

Storage at 25k photos

The footprint is dominated by the inlined JPEGs — about 3.7 GB once you sum the full-resolution image bytes Vespa serves directly. The first cut of this demo hot-linked images.unsplash.com from the browser; the result-grid felt slow because every hit fired a fresh CDN round-trip per thumbnail. Backfilling the images into a raw field (full_image on t2i_photo, served through a named with-image document-summary so search responses don't ship 150 KB × hits) flipped Vespa into the role of image origin and the latency went away. Everything else is rounding error next to the JPEGs:

image bytes (inlined JPEG, ~150 KB avg)3.75 GBembedding (float32)76.8 MBBM25 indices (≈ approx.)6.4 MBHNSW links (m=16)1.6 MB

At 25k photos the whole thing — JPEGs, embeddings, indices — fits comfortably on one t3-sized content node. At 1M+ photos you'd either keep hot-linking the CDN (the original design, cheaper at scale) or pair the inlining with the quantisation trick the RIS demo uses — sign-bit quantise the embedding to a 96-byte binary, use Hamming HNSW for first-phase, full-precision float for rerank. The schema pattern is identical; only the indexing pipeline changes.

Methodology

Takeaways

  1. Semantic search is BM25's upgrade path, not a replacement. The two answer slightly different questions: BM25 finds documents whose text matches the query; semantic finds documents whose content matches the query intent. Photo libraries are the canonical case where the second matters more.
  2. The schema doesn't pick winners. One photo.sd defines three rank profiles. Switching modes is a query parameter, not a redeploy. The UI's mode toggle and α slider drive ranking entirely; no backend changes required.
  3. Vespa's global-phase + normalize_linear is the cleanest hybrid primitive we know. Per-query min/max normalisation, no learned calibration, no offline tuning. Mix two unrelated signals in one expression.
  4. SigLIP 2 is the better default than OpenAI CLIP in 2026 for new builds — same shape, better quality, friendlier license, multilingual out of the box. If you're prototyping a text-to-image search today, start here.

Vespa Cloud · SigLIP 2 · Unsplash Lite · Dataset on GitHub