Contrastive Learning

Also known as: contrastive training, metric learning, representation learning

TL;DR

The training paradigm behind almost every modern embedding model. Pull positive pairs (query, relevant document) close in vector space; push negatives far apart.

Contrastive learning trains an encoder so that semantically related inputs land near each other in vector space and unrelated inputs land far apart. Concretely: take a positive pair , a set of negatives , and minimize a loss that increases when is below . Run that across millions of pairs and the encoder converges on a metric space where distance encodes meaning.

CONTRASTIVE LEARNINGPull positives together, push negatives apart.EMBEDDING SPACEINITIALIZATIONPULLPUSHquerydocumentINFO-NCE LOSSstep → ∞2.21Random initialization. Queries and docs scattered across the space.

This is the dominant recipe for training models — almost every text embedder you can name (E5, BGE, GTE, Voyage, OpenAI’s, Cohere’s, zembed-1) is contrastively trained, often with downstream distillation refinement.

The objective

For each query with positive document and negatives , the loss pushes higher than for every . The most common form is , which softmax-normalizes those similarities and treats it as an N+1-way classification problem where the positive is the right answer. Other variants (triplet loss, margin-based, multiple-negatives ranking) exist but most large-scale training has converged on InfoNCE-style.

The metric is whichever you’ll use at retrieval time — typically on L2-normalized vectors, equivalently a dot product. Train and serve in the same metric or you’ll leave performance on the floor.

Why the negatives matter more than the positives

Most engineers tune positives carefully and pull negatives randomly. That’s backwards. The encoder learns the structure of the space from the gradient, and the gradient comes from negatives. Random negatives quickly become trivially separable — the model nails them in the first epoch and then learns nothing.

The fix is sourcing negatives that resemble positives but aren’t actually relevant — same domain, same vocabulary, near-duplicate semantics. See and the cheap-but-decent baseline of .

The InfoNCE softmax has a temperature controlling how sharply it focuses on the hardest negative. Low (): the loss is dominated by the single hardest negative, gradient signal is concentrated, training is unstable but discriminative. High (): the loss spreads attention across all negatives, training is smoother but the embedding space ends up less separated. Most production embedders sit around . It’s one of the few hyperparameters that actually matters.

What contrastive learning produces (geometrically)

After enough training, the encoder carves the unit sphere into clusters: each cluster is a region of input space the model decided is “one meaning”. Within a cluster, distances are small; across clusters, distances are large. This is exactly the structure an index exploits — the graph walks toward the cluster the query lands in and stops.

Two emergent properties show up:

Empirical regularities
  • Alignment: positives end up close. Measured as expected over the data.
  • Uniformity: random points spread across the sphere rather than collapsing to one corner. Measured by the Gaussian potential of the embedding distribution.
  • Linear analogies sometimes work: . Side-effect of the loss, not designed in.

Wang & Isola (2020) showed alignment + uniformity are basically what InfoNCE optimizes for. The loss itself decomposes into those two terms in the limit of infinite negatives.

Where contrastive learning is heading

Vanilla contrastive supervision is binary, and binary signal leaves accuracy on the table. The frontier is two-stage: contrastive pretraining for the geometry, then from a reranker for graded relevance. The student inherits the smooth, continuous relevance score the teacher learned from human or LLM judgments. zembed-1 follows this recipe — contrastive bootstrap, then distill from zerank-2.

Go further

Why does contrastive learning produce useful retrieval embeddings?

The objective directly aligns with the inference task. At training, the loss pushes positives closer than negatives in the same metric (cosine or dot product) you'll use at search time. There is no train-test mismatch — the geometry the loss optimizes is the geometry the index queries against.

What kinds of pairs work as positives?

Anything where the model should treat the two as semantically equivalent or related: (query, relevant passage), (caption, image), (paraphrase A, paraphrase B), (title, abstract), augmented (text, augmented-text). The supervision can be human-labeled, click-derived, or LLM-synthesized. Quality of positives sets the ceiling on the model.

Where does contrastive training fall short?

Binary positive/negative supervision throws away graded relevance — the model learns 'related vs unrelated' but not 'how related'. Distilling from a cross-encoder reranker recovers the continuous signal. That's why distilled embedders dominate NDCG-style benchmarks against vanilla contrastive ones.

ZeroEntropy
The best AI teams build with ZeroEntropy models
Follow us on
GitHubTwitterSlackLinkedInDiscord