Bi-Encoder

Also known as: dual encoder, two-tower model, siamese encoder

TL;DR

A bi-encoder embeds the query and the document separately into vectors, then compares them with a dot product or cosine. Fast and cacheable — the basis of every dense retrieval system.

A bi-encoder encodes the query and the document into two separate vectors, then ranks documents by their similarity. The encoding happens in isolation — the model never sees both at once. A document’s vector doesn’t depend on the query, so you cache it forever; at query time you embed only the query and run a similarity search against the prebuilt index.

This is what makes bi-encoders the backbone of first-pass retrieval. You can index a billion documents offline once, then serve millions of queries against that index for the cost of one query embedding plus an approximate-nearest-neighbor lookup. A cross-encoder couldn’t possibly do that — it would need a fresh forward pass per (query, document) pair.

The accuracy/latency tradeoff

Bi-encoders are fast and cheap; cross-encoders are slow and accurate. Production retrieval pipelines combine them: a bi-encoder fetches a few hundred candidates per query in milliseconds, then a cross-encoder ( reranker ) reorders that candidate set in ~150ms. You get most of the cross-encoder’s accuracy at a fraction of its cost.

A bi-encoder is a cross-encoder with the cross-attention surgically removed for the sake of caching.

The bi-encoder produces a single fixed-size vector for the query and a single fixed-size vector for the document, then compares them with a dot product. Whatever signal the encoder couldn’t compress into that vector is gone. A cross-encoder, by contrast, runs full self-attention over the concatenation of query and document — every query token can attend to every document token.

The signals that compress poorly into a single vector are the ones that involve joint analysis. Negation: a document that says “the policy does not cover dental” should score low for “does the policy cover dental?”, but the bi-encoder vector for the document encodes both “policy” and “dental” prominently, and the dot product can’t distinguish “covers” from “does not cover”. Quantifier scoping: “all” vs “some” vs “no” creates very different relevance, but a bi-encoder’s compressed representation is dominated by content words. Phrase-level grounding: a long document with one sentence answering the query and 99 sentences of unrelated text gets averaged toward “unrelated” in the embedding.

This is the entire reason rerankers exist. The bi-encoder gets you 95-99% of recall on the top-100 candidates; the cross-encoder reorders that top-100 with full joint attention and recovers the precision the bi-encoder couldn’t carry.

Distillation. Train the bi-encoder so its dot-product output matches the cross-encoder’s relevance score on the same (query, document) pairs. The bi-encoder still can’t represent everything the cross-encoder represents — single vector vs joint attention — but it can be pushed much closer to the cross-encoder’s ordering than naive contrastive training gets you.

The training shape: take a query corpus, for each query produce candidate documents (positives, hard negatives, easy negatives), score each (query, document) pair with the teacher cross-encoder, and train the student bi-encoder to reproduce those scores via a regression or KL loss. The bi-encoder learns to allocate its embedding capacity toward the dimensions that matter for the teacher’s judgments.

zembed-1 is exactly this — a bi-encoder distilled from zerank-2’s pointwise relevance scores. The result is a bi-encoder that tops graded retrieval benchmarks because its vector space is shaped by a cross-encoder’s relevance judgments, not by the standard “pull positives close, push negatives far” objective. This is the embedding-training analog of the rerankers-as-teachers pattern that’s increasingly dominant in production retrieval.

What “siamese” / “two-tower” mean

You’ll see “siamese encoder” or “two-tower model” used as synonyms. Both phrases describe the same architecture — two parallel encoders (often sharing weights) that produce vectors which then meet at a similarity step. Some bi-encoders use separate weights for the query encoder and document encoder (asymmetric), since “what is the capital of France?” and “Paris is the capital of France” are linguistically different shapes that benefit from different encoders.

ZeroEntropy’s bi-encoder

zembed-1 is a bi-encoder distilled from zerank-2 — the dense vectors carry the cross-encoder’s relevance signal compressed into a fixed-size embedding. See the Embeddings product page.

Go further

Where exactly does a bi-encoder lose accuracy vs a cross-encoder?

The query and document never get to attend to each other. Subtle relevance — phrase-level grounding, negation, scoped quantifiers — shows up in cross-attention but is hard to compress into a single fixed vector. That's the entire reason rerankers exist.

Cross-encoder Reranker Cosine similarity

Why bother with a bi-encoder at all if cross-encoders are more accurate?

Cost. A bi-encoder lets you embed billions of docs once and run a fresh query for the cost of one embed plus an ANN lookup. A cross-encoder needs a forward pass per (query, doc) pair — fine for the top-100, infeasible across the whole corpus.

First-pass retrieval Recall@K Hybrid search end-to-end (playbook)

How can a bi-encoder approximate cross-encoder quality?

Distill it from one. Train the bi-encoder so its dot-product output matches the cross-encoder's relevance score on the same (query, doc) pairs. zembed-1 is exactly this — a bi-encoder distilled from zerank-2's pointwise scores, which is why it tops graded retrieval benchmarks.

Knowledge distillation zELO training methodology Embedding

← All concepts