Logits

Also known as: logit, raw scores, pre-softmax

TL;DR

Logits are the raw, pre-softmax score vector a language model outputs at each position — one real-valued score per vocabulary token. They're the currency of decoding: every sampling strategy, calibration trick.

Logits are the raw output of a language model’s final linear projection — a vector of real numbers, one per vocabulary token, before any normalization. Apply a softmax and you get a probability distribution; before that, you have logits. Every decoding strategy, sampling trick, and calibration adjustment operates on logits.

LOGITSUnbounded scores, one per vocabulary token, before softmax.HIDDEN STATEht∈ ℝdd = 4096Wout|V| × dLINEAR PROJECTIONLOGITSzt∈ ℝ|V|+4.2mat+2.7sofa+1.6floor+0.9rug+0.2bed-0.4chair-1.1table-1.9desk-2.6roof+4+2+0-2-4|V| ≈ 50Ksoftmaxpi= ezi/ Σ ezjPROBABILITIESp∈ [0, 1]|V|.729.163.054.027.013.007.004.002.001Σ pi = 1LOGITSz ∈ ℝunbounded · signed · can be any real numberAFTER /Tz / Ttemperature scaling lives here, in logit spacePROBSp ∈ [0,1]after softmax · sums to 1 · sampleableEVERY DECODING TRICK — TEMPERATURE, BIAS, MASK, TOP-K — HAPPENS BEFORE THE SOFTMAX

Where they come from

In a , the final transformer block produces a hidden state at each position . A learned linear layer (often the transposed input embedding matrix, “tied weights”) projects this into vocabulary space:

is the logit vector at position . Each entry is the unnormalized score for vocabulary token being the next token after position . To turn this into a probability distribution:

Sample from this and you have an generated next token.

SOFTMAX TRANSFORMUnbounded scores become a normalized distribution.RAW LOGITSx ∈ ℝ0+5.1mat+3.8rug+2.4sofa+1.6floor+0.2chair-0.9table-2.0bed-3.2desksoftmax(xi) =exiΣjexjPROBABILITIESp ∈ [0, 1], Σ p = 10.7200.72mat0.20rug0.05sofa0.02floor0.01chair2e-3table6e-4bed2e-4deskA non-linear amplifier: small gaps in logit space become large gaps in probability.The token with the highest logit always has the highest probability.

Why logits matter beyond just sampling

Several useful operations live in logit space, before softmax:

  • Temperature scaling. Divide logits by . Lower sharpens the distribution; higher flattens it. Trivially cheap because softmax is invariant to additive constants but not to scale.
  • Logit bias. Add a constant to the logit of specific tokens to bias generation toward or away from them. Common in API parameters like OpenAI’s logit_bias.
  • Token masking. Set logits of forbidden tokens to . After softmax, those tokens have zero probability. Used in constrained decoding (JSON-mode, regex-constrained outputs, grammar enforcement).
  • Top-k / top-p sampling. Both work by zeroing out (or excluding) all but a subset of high-logit tokens before normalizing.

If you only had probabilities, all of these would be more awkward; log-space is cleaner because softmax is invariant to additive constants and a sum of logits beats a product of probabilities for numerical stability.

Operations that live in logit space
  • Temperature scaling — divide by to sharpen or flatten the distribution
  • Logit bias — add per-token constants to nudge generation
  • Token masking — set forbidden tokens to for hard constraints
  • Top-k / top-p truncation — keep only high-logit tokens before normalizing
  • Cross-entropy loss — log-sum-exp on logits is the numerically stable form
  • Speculative decoding — compare draft vs target logits to accept/reject

Naively computing log(sum(exp(z))) overflows for any logit above ~88 in fp32. The standard trick is to subtract the max logit before exponentiating: . The shifted exponentials are all in , the sum is well-behaved, and softmax invariance under additive constants means the answer is identical. Every transformer training loop and inference kernel does this — if you ever see NaN losses early in training, suspect a missing log-sum-exp shift in a custom sampling head.

Logits and loss

Cross-entropy loss is computed directly from logits and the target token id, not from probabilities:

This formulation is numerically stable (the log-sum-exp trick) and is what every transformer training loop uses. It’s why training and computations operate directly in logit space.

Calibration

A model’s softmax distribution is often described as the model’s “confidence” — but raw LLM probabilities are usually badly calibrated. A token with 80% softmax mass isn’t actually right 80% of the time. RLHF tends to make this worse, sharpening probability mass on preferred tokens beyond what their actual reliability warrants.

When the downstream system actually needs the numbers to mean something — thresholding, ranking, comparing across queries — raw LLM logits aren’t the right interface. The standard responses are to a model that produces meaningful scores by construction or to apply a post-hoc calibration step (Platt scaling, isotonic regression) on top of the raw logits. See and for how this gets used in retrieval.

Logits are the model’s raw belief; the softmax and any tricks applied on top shape what comes out. Almost every interesting decoding behavior — constraint, structure, calibration — happens at the logit layer.

Go further

Why work with logits instead of probabilities?

Two reasons. Numerical stability — log-space avoids underflow when probabilities are tiny. Flexibility — biasing logits is a clean way to enforce constraints (forbid certain tokens, prefer others, mask invalid choices). After you've shaped the logits, one softmax converts them to probabilities.

What's a 'logit bias' and when do you use it?

A constant added to the logit of specific tokens before sampling — positive values increase the chance of that token, negative values suppress it. Common uses: forbid known-bad tokens, force JSON-like output structure, gently steer style. Most LLM APIs expose a logit_bias parameter for exactly this.

Are an LLM's softmax probabilities calibrated?

Usually not. A model that outputs 90% probability on a token is not actually right 90% of the time on that kind of generation. Pretraining and RLHF both distort calibration. For applications that need calibrated probabilities (entailment scoring, retrieval reranking), you typically [fine-tune](/concepts/fine-tuning/) a head specifically for it or apply post-hoc calibration.

ZeroEntropy
The best AI teams build with ZeroEntropy models
Follow us on
GitHubTwitterSlackLinkedInDiscord