InfoNCE Loss

Also known as: NCE loss, noise-contrastive estimation, contrastive cross-entropy

TL;DR

InfoNCE is the contrastive loss objective behind almost every modern embedder. For each positive pair, softmax-normalize the similarities of (positive, negatives) and treat it as N+1-way classification.

InfoNCE — Information Noise-Contrastive Estimation — is the loss function powering almost every modern embedding model. For a query , its positive document , and a set of negatives , the loss is the cross-entropy of a softmax over the similarity scores, where the positive is the correct class:

L = -log( exp(sim(q, d+) / τ) / Σ_i exp(sim(q, d_i) / τ) )

The denominator sums across the positive and all negatives. is typically cosine similarity on L2-normalized vectors; is the temperature.

That’s it. One line of math. Every modern open-weights embedder — E5, BGE, GTE, Snowflake Arctic, zembed-1’s bootstrap stage — minimizes some variant of it.

The three knobs that actually matter

Three things determine whether InfoNCE produces a good embedding model:

Number of negatives . More negatives = tighter mutual-information bound = better embeddings. The free way to scale this is in-batch negatives — every other example in the batch is a negative for free. That’s why state-of-the-art embedders use batch sizes in the thousands.
Quality of negatives. Random negatives saturate fast; the model gets them right and learns nothing. Hard-negative mining is what unlocks the second half of training.
Temperature . Lower makes the softmax sharper. The model focuses gradient on the hardest negative — usually good, sometimes destabilizing.

The mutual-information story

Oord, Li & Vinyals (2018) showed that minimizing InfoNCE lower-bounds the mutual information between query and document representations:

I(q; d) ≥ log(N) - L_InfoNCE

So minimizing the loss maximizes a bound on — pushing the encoder toward representations that maximally preserve the dependence between matched pairs. The bound tightens with , which is the formal reason batch size matters.

Bigger batches yield better embeddings because they make the InfoNCE bound tighter. The math demands it.

Where InfoNCE breaks down

This is the false-negative problem and it’s brutal at scale. With in-batch negatives, any other document in the batch is treated as a negative — but if two queries in the batch share a relevant document, the loss actively pushes the model away from the right answer. The fixes: (a) deduplicate batch construction by topic, (b) use a hard mask that excludes known positives, (c) accept that very large batches statistically dilute the noise. Production embedder training pipelines do all three.

The loss is one-hot: positive class gets probability 1, every negative gets 0. The model learns to push positives to high similarity and all negatives to similarly low similarity — there’s no signal that one negative is “more wrong” than another. To get graded relevance (which NDCG-style benchmarks reward), you have to either (a) distill from a teacher that produces graded scores, or (b) augment InfoNCE with a regression term against a continuous label. Pure InfoNCE caps out at binary discrimination.

Variants you’ll see in papers

Symmetric InfoNCE (CLIP-style): average the loss over both directions — and . Standard for multimodal embeddings .
Multiple-negatives ranking loss (MNRL): functionally identical to InfoNCE with in-batch negatives; the name predates InfoNCE’s adoption.
Decoupled contrastive loss: removes the positive from the denominator, claimed to be more stable. Marginal in practice.
SupCon (supervised contrastive): generalizes to multiple positives per anchor, useful when labels are class-level.

The literature is full of “novel” contrastive losses that turn out to be InfoNCE in a wig. When evaluating a paper’s loss claim, check whether the math reduces to weighted log-sum-exp over similarities. Usually it does.

Go further

What's the relationship between InfoNCE and mutual information?

Oord et al. (2018) proved InfoNCE is a tractable lower bound on the mutual information between the two views. Maximizing InfoNCE therefore maximizes a bound on I(query; document). The bound tightens as the negative count grows — which is why batch size matters so much for embedding quality.

Contrastive learning In-batch negatives

Why does temperature matter so much?

Temperature controls how sharply the softmax weighs the hardest negative. Low temperature focuses gradient on the single most-confusable negative; high temperature averages across all of them. The standard range is 0.05-0.1 for retrieval embedders — too low and training collapses, too high and the model never separates near-duplicates.

Contrastive learning Hard-negative mining

Is InfoNCE the only contrastive loss that works?

No. Triplet, margin-based, and multiple-negatives ranking losses all work. But InfoNCE has converged as the default because it scales naturally to many negatives, has a clean information-theoretic interpretation, and trains stably with mixed-precision. Almost every recent embedder paper uses it.

Contrastive learning Embedding

← All concepts

The best AI teams build with ZeroEntropy models

Book Demo View docs