ColBERT

Also known as: late-interaction, ColBERTv2, MaxSim, PLAID

TL;DR

A late-interaction retrieval architecture: encode each token of query and document into its own vector, score pairs by maxsim. Sits between bi-encoder (one vector per text, fast) and cross-encoder (full attention, accurate but slow).

ColBERT (Contextualized Late Interaction over BERT) is a retrieval architecture that occupies the middle ground between bi-encoders and cross-encoders . Where a bi-encoder pools every token into one vector and a cross-encoder runs full attention across query and document jointly, ColBERT does something in between: encode each token of the query and the document into its own vector, then aggregate via MaxSim — for every query token, find its most similar document token, and sum those maxima.

score(q, d) = Σ_{q_i ∈ q} max_{d_j ∈ d} sim(q_i, d_j)

Each sim is a dot product. The aggregation is “late” because the interaction between query and document happens after both have been encoded, not inside the transformer layers. This preserves bi-encoder-style cacheability (document tokens can be embedded offline) while restoring most of the cross-encoder’s per-token discrimination.

Why MaxSim works

The aggregation has a clean intuition. Every word in the query gets to “look for” its best match in the document. If the query is “how do I reset my password”, then reset finds reset, password finds password, how do I finds something — usually a generic high-frequency token. The total score sums those soft matches.

This is closer to how human relevance judgments actually work — readers scan for the cues, not for an aggregate “vibe”. A bi-encoder forces every token’s contribution into a single 1024-dim vector and inevitably blurs them; ColBERT’s per-token vectors keep them distinct.

Per-token embeddings + MaxSim aggregation = bi-encoder cacheability with most of cross-encoder accuracy. The middle of the latency-quality tradeoff curve.

The cost: storage

For a document of N tokens, ColBERT stores N vectors. At ~128 dims each (standard) and 4-byte floats, a 200-token document needs ~100KB versus a bi-encoder’s 4KB — about 25x. For a 100M-document corpus that’s the difference between 400GB and 10TB.

ColBERTv2 (Santhanam et al., 2022) attacked this directly. Token vectors are clustered into centroids; each token-vector is then represented as (centroid_id, residual_quantized). This compresses to ~32 bits per token-vector, bringing storage back to manageable territory.

How retrieval actually works at scale

Naively, MaxSim across an entire corpus is — too slow. PLAID (Santhanam et al., 2022) makes it tractable with three stages:

Centroid lookup. Each query token’s nearest centroids identify candidate documents (any document whose tokens map to those centroids). This narrows millions of documents down to thousands without computing a single MaxSim.
Approximate scoring. Compute MaxSim using only the centroid representations of document tokens. Cheap, lossy, fast.
Exact scoring. For the top candidates from step 2, decompress the residuals and compute exact MaxSim. This is where the final ordering comes from.

End-to-end ColBERTv2 + PLAID on 100M passages runs at ~50ms per query on a single GPU. Competitive with bi-encoder + reranker pipelines.

When to use ColBERT vs the cascade

Most production retrieval stacks pair a bi-encoder for first-pass with a cross-encoder reranker for ordering — it’s operationally simpler and storage-cheaper. ColBERT is the right answer in two specific cases:

Out-of-domain robustness. Bi-encoders trained on web data generalize poorly to legal, medical, or technical corpora; MaxSim’s token-level matching recovers a large part of that gap without retraining. This was the original ColBERT pitch.
Single-stage retrieval requirements. When operational complexity demands one model rather than a cascade — typically because reranker latency or cost is unacceptable — ColBERT delivers most of the cascade’s accuracy in a single pass.

The descendants

ColBERT spawned a family. ColPali extends late interaction to image patches for visual document retrieval (treat each patch as a token). PLAID, EMVB, and Vespa’s native ColBERT support are the production engines. JaColBERT and Italian variants are language-specific finetunes. Late interaction has survived seven years of being declared “interesting but impractical” and still shows up on the leaderboards.

Paper

loading…

ColBERT: Efficient and Effective Passage Search via Contextualized Late Interaction over BERT

Omar Khattab, Matei Zaharia (Stanford)

Recent progress in NLP is owed to large language models like BERT, but their fine-grained relevance modeling is computationally expensive. ColBERT introduces a late-interaction architecture that independently encodes the query and the document using BERT and then employs a cheap yet powerful interaction step that models their fine-grained similarity. By delaying and yet retaining this fine-granular interaction, ColBERT can leverage the expressiveness of deep LMs while simultaneously gaining the ability to pre-compute document representations offline.

Go further

Why is late interaction more accurate than a bi-encoder?

A bi-encoder pools every token into one vector and discards positional and lexical detail. ColBERT keeps each token's vector and lets every query token attend to its best-matching document token. Phrase-level grounding, exact-token cues, and rare-word matches survive — exactly what a single pooled vector loses.

Bi-encoder Cross-encoder Embedding

How does ColBERT scale to millions of documents?

Each document of N tokens needs N vectors stored — 30-100x more storage than a bi-encoder. ColBERTv2 introduced residual quantization to compress vectors aggressively (down to ~32 bits per token-vector). The PLAID engine then prunes candidate documents using centroid lookups before MaxSim. Storage is the dominant cost; latency is competitive.

Embedding quantization ANN Bi-encoder

Should I deploy ColBERT or a bi-encoder + cross-encoder pipeline?

For most production retrieval the bi-encoder + cross-encoder cascade wins on operational simplicity and storage cost. ColBERT shines when single-stage retrieval needs higher recall than a bi-encoder gives, especially on out-of-domain corpora where bi-encoders generalize poorly but token-level matching survives.

Reranker Cross-encoder Hybrid search

← All concepts

The best AI teams build with ZeroEntropy models

Book Demo View docs