Knowledge Distillation

Also known as: model distillation, teacher-student training

TL;DR

Training a small (student) model to mimic the outputs of a larger (teacher) model — getting most of the teacher's quality at a fraction of the cost. The basis of essentially every production deployment of small specialized models.

Knowledge distillation, introduced by Hinton et al. in 2015, is the technique of training a small “student” model to reproduce the outputs of a larger “teacher” model, rather than training the student on the original ground-truth labels. The student learns from the teacher’s smooth probability distributions, which carry far more information than hard labels.

The classic example: train a tiny image classifier to mimic an ensemble of giant ones. The student often comes within a fraction of a percentage point of the teacher’s accuracy at 10% of the size and 5% of the inference cost.

The two information channels

When a teacher model says “this image is 87% cat, 10% dog, 3% wolf”, the student receives:

The hard label (cat) — the same signal it would get from human-annotated training data.
The relative magnitudes of the other classes — “this image looks somewhat like a dog and barely like a wolf”. This relational information is what makes distillation more powerful than vanilla supervised training.

For ranking models, the analogous insight: the teacher’s continuous score over (query, doc) pairs carries graded information that a binary “relevant / not” label loses entirely.

Hinton’s original recipe scales the teacher’s logits by a temperature T before softmax: . Higher T flattens the distribution, exposing the relative magnitudes between non-top classes. With T = 1 (no scaling), the teacher’s softmax is dominated by the top class and the student barely sees the secondary structure that makes distillation worthwhile. Typical values are T = 4 to 10 during distillation, T = 1 at student inference time. For ranking distillation the analogous knob is whether you regress on raw logits, sigmoid scores, or pairwise BCE — the “temperature” is implicit in the choice of target.

Distillation in retrieval

Two important distillations in modern retrieval:

LLM ensemble → pairwise SLM. In zELO , three frontier LLMs (Claude, GPT, Gemini) vote on pairwise document preferences. Their averaged probability is the teacher signal. A 4B-parameter pairwise model is trained to mimic that probability via BCE loss. The result is a small model that runs ~1000× faster than the ensemble at near-ensemble quality.
Cross-encoder → bi-encoder. zembed-1 is a bi-encoder distilled from zerank-2 (a cross-encoder ). The student learns to produce embedding vectors whose dot product approximates the teacher’s relevance score. The student loses some accuracy (bi-encoders fundamentally can’t do everything cross-encoders can) but gains massive throughput — embeddings are cacheable and ANN-indexable, scores are not.

Famous distillations across the field

DistilBERT — 40 percent smaller, 60 percent faster than BERT-base at 97 percent of GLUE quality; the proof point that distillation generalized to language models.
TinyBERT and MobileBERT — pushed the size-quality frontier further; layer-wise distillation including hidden states and attention maps.
Alpaca and Vicuna — instruction-tuned LLaMA students distilled from frontier-LLM-generated supervision; not classical logit distillation but the same shape.
zerank-2 — pairwise cross-encoder distilled from a Claude plus GPT plus Gemini ensemble through a Thurstone fit.
zembed-1 — bi-encoder distilled from zerank-2; trades cross-encoder quality for ANN-indexable throughput.

Why distillation matters now

A frontier LLM is the right teacher — generalist, smart, expensive. A 0.5B-4B specialized student is the right runtime — narrow, fast, cheap. Distillation is the bridge, and it’s the reason most production deployments of small specialized models exist at all.

Frontier LLMs are the right teacher. Small specialized models are the right runtime. Distillation is the bridge.

For custom-trained models (query rewriting, context compression, classification), the same shape repeats: gather supervision from frontier LLMs (often pairwise), fit continuous targets, distill into a small student tuned for the exact task.

Go further

Does the student need to share architecture with the teacher?

No — the student just needs a head that produces outputs of the same shape as the teacher's. zerank-2 (a 4B cross-encoder) distills from a frontier-LLM ensemble; zembed-1 (a bi-encoder) distills from zerank-2. Different architectures, same scalar relevance target.

Cross-encoder Bi-encoder zELO

Why are pairwise preferences a better distillation signal than absolute scores?

Pairwise judgments from frontier LLMs are far less noisy than asking for absolute scores — annotators agree on 'A is better than B' even when they disagree on 'how good is A'. Distillation through a Thurstone fit converts that low-noise signal into a calibrated continuous target the student regresses against.

Pairwise preference Thurstone model Score calibration

What's the practical size/quality tradeoff in retrieval?

A 4B distilled reranker matches or beats much larger proprietary rerankers on NDCG@10 while running 10-100x cheaper. The student inherits most of the teacher's domain quality; the loss is mostly in long-tail edge cases that don't move aggregate metrics.

Reranker NDCG@K Pointwise scoring

← All concepts

The best AI teams build with ZeroEntropy models

Book Demo View docs