Contrastive Learning

Also known as: contrastive training, metric learning, representation learning

TL;DR

The training paradigm behind almost every modern embedding model. Pull positive pairs (query, relevant document) close in vector space; push negatives far apart.

Contrastive learning trains an encoder so that semantically related inputs land near each other in vector space and unrelated inputs land far apart. Concretely: take a positive pair , a set of negatives , and minimize a loss that increases when is below . Run that across millions of pairs and the encoder converges on a metric space where distance encodes meaning.

This is the dominant recipe for training embedding models — almost every text embedder you can name (E5, BGE, GTE, Voyage, OpenAI’s, Cohere’s, zembed-1) is contrastively trained, often with downstream distillation refinement.

The objective

For each query with positive document and negatives , the loss pushes higher than for every . The most common form is InfoNCE , which softmax-normalizes those similarities and treats it as an N+1-way classification problem where the positive is the right answer. Other variants (triplet loss, margin-based, multiple-negatives ranking) exist but most large-scale training has converged on InfoNCE-style.

The metric is whichever you’ll use at retrieval time — typically cosine similarity on L2-normalized vectors, equivalently a dot product. Train and serve in the same metric or you’ll leave performance on the floor.

Why the negatives matter more than the positives

Most engineers tune positives carefully and pull negatives randomly. That’s backwards. The encoder learns the structure of the space from the gradient, and the gradient comes from negatives. Random negatives quickly become trivially separable — the model nails them in the first epoch and then learns nothing.

The fix is sourcing negatives that resemble positives but aren’t actually relevant — same domain, same vocabulary, near-duplicate semantics. See hard-negative mining and the cheap-but-decent baseline of in-batch negatives .

The InfoNCE softmax has a temperature controlling how sharply it focuses on the hardest negative. Low (): the loss is dominated by the single hardest negative, gradient signal is concentrated, training is unstable but discriminative. High (): the loss spreads attention across all negatives, training is smoother but the embedding space ends up less separated. Most production embedders sit around –. It’s one of the few hyperparameters that actually matters.

What contrastive learning produces (geometrically)

After enough training, the encoder carves the unit sphere into clusters: each cluster is a region of input space the model decided is “one meaning”. Within a cluster, distances are small; across clusters, distances are large. This is exactly the structure an ANN index exploits — the graph walks toward the cluster the query lands in and stops.

Two emergent properties show up:

Empirical regularities

Alignment: positives end up close. Measured as expected over the data.
Uniformity: random points spread across the sphere rather than collapsing to one corner. Measured by the Gaussian potential of the embedding distribution.
Linear analogies sometimes work: . Side-effect of the loss, not designed in.

Wang & Isola (2020) showed alignment + uniformity are basically what InfoNCE optimizes for. The loss itself decomposes into those two terms in the limit of infinite negatives.

Where contrastive learning is heading

Vanilla contrastive supervision is binary, and binary signal leaves accuracy on the table. The frontier is two-stage: contrastive pretraining for the geometry, then distillation from a cross-encoder reranker for graded relevance. The student inherits the smooth, continuous relevance score the teacher learned from human or LLM judgments. zembed-1 follows this recipe — contrastive bootstrap, then distill from zerank-2.

Go further

Why does contrastive learning produce useful retrieval embeddings?

The objective directly aligns with the inference task. At training, the loss pushes positives closer than negatives in the same metric (cosine or dot product) you'll use at search time. There is no train-test mismatch — the geometry the loss optimizes is the geometry the index queries against.

What kinds of pairs work as positives?

Anything where the model should treat the two as semantically equivalent or related: (query, relevant passage), (caption, image), (paraphrase A, paraphrase B), (title, abstract), augmented (text, augmented-text). The supervision can be human-labeled, click-derived, or LLM-synthesized. Quality of positives sets the ceiling on the model.

Synthetic data generation Embedding

Where does contrastive training fall short?

Binary positive/negative supervision throws away graded relevance — the model learns 'related vs unrelated' but not 'how related'. Distilling from a cross-encoder reranker recovers the continuous signal. That's why distilled embedders dominate NDCG-style benchmarks against vanilla contrastive ones.

Knowledge distillation NDCG@K Cross-encoder

← All concepts

The best AI teams build with ZeroEntropy models

Book Demo View docs