SFT is plain supervised learning applied to a pre-trained language model: given (input, target) pairs, train the model to produce the target. The umbrella term for any fine-tuning that's not preference-based — distinct from RLHF and DPO.
Supervised fine-tuning (SFT) is the umbrella term for plain (input, target) fine-tuning of a language model. It’s distinguished from RLHF and DPO , which use preferences (input, winner, loser) instead of targets. If you’re training on labeled examples and computing cross-entropy loss against a single target, you’re doing SFT.
The mechanic
For each training example :
Concatenate input and target: [x_tokens, y_tokens].
Forward pass through the model.
Compute cross-entropy loss on the predicted distribution over target tokens — typically masking the input tokens out of the loss so they don’t contribute to the gradient.
Backprop, update weights with a small learning rate (typically to ).
The math is identical to pre-training; the difference is the data distribution and the smaller learning rate, which preserves the base model’s general capability while shifting it toward the new task.
What gets SFT’d
Common SFT applications in production
Instruction-following models. Instruction tuning is SFT on (instruction, response) pairs.
Domain-adapted models. Fine-tune a generalist on legal, medical, or financial documents to absorb domain vocabulary and style.
Task-specialized models. Rerankers , classifiers, query rewriters, PII redactors — all SFT on task-specific labeled data.
Distilled models. Knowledge distillation is SFT where the targets come from a teacher model. zerank-2’s training is SFT against zELO -recovered Thurstone scores.
Format adapters. Train a base model to emit JSON, XML, or domain-specific output formats reliably without prompt scaffolding.
The default behavior of a language-model training loop is to compute loss on every token in the sequence — the model is supervised to predict each token from its predecessors. For pre-training this is exactly right: the corpus is one continuous stream of “natural” text, and predicting any token is meaningful supervision. For SFT, the input portion of (input, target) pairs is not a target the model should learn to generate; it’s a prompt the model needs to condition on.
Without loss masking, the model is being trained to predict its own input — which means it’s allocating gradient updates to memorizing prompt patterns rather than learning the target behavior. In practice this manifests as the model hallucinating prompt-shaped continuations during inference: instead of answering the question, it starts generating a new question. It also wastes capacity: a 128-token prompt with a 32-token target is 80% supervision-on-input if you don’t mask, and only 20% supervision-on-the-thing-you-care-about.
The mask itself is straightforward — multiply the per-token cross-entropy by 0 for input positions and 1 for target positions before reducing — but it has to be applied correctly across packed sequences, multi-turn conversations, and tool-use traces. Frameworks like TRL and Axolotl handle this by default, but a custom training loop without explicit masking is a common bug. The symptom is a model that “almost works” but produces weirdly templated outputs that mirror the training prompt.
A subtler variant: in multi-turn conversations, you typically mask all prior assistant turns plus all user turns, training only on the final assistant turn. Some recipes train on all assistant turns instead, which can amplify the assistant’s signal but also amplify any mistakes baked into intermediate turns of the training data.
LoRA injects low-rank adapters into the attention and MLP projection matrices, freezing the base weights. The number of trainable parameters drops by 100-1000x, training memory drops proportionally, and the resulting adapter weights are tiny — single-digit megabytes — making them easy to deploy, swap, and version. For most narrow specialization tasks, LoRA is sufficient and cheaper.
Full-parameter SFT wins in a few specific regimes. First, when the target distribution is far from the base model’s pre-training distribution — a code model adapted to a wholly new programming language, or a chat model trained on a specialty domain like clinical notes. The low-rank constraint of LoRA limits how far the adapter can move the model from its base, and a sufficiently distant target distribution simply doesn’t fit in a rank-16 update. Second, when the dataset is large (millions of examples). LoRA’s effective capacity is bounded by the rank; at scale, full fine-tuning can absorb signal that LoRA leaves on the table.
The practical signal that you need full fine-tuning instead of LoRA is straightforward: train both on a held-out validation set, scan ranks 8/16/32/64. If LoRA validation loss saturates at a higher floor than full fine-tuning even at rank 64+, the task genuinely needs the additional capacity. If LoRA at rank 16 matches full fine-tuning, ship LoRA — the deployment ergonomics are too good to give up.
A middle ground worth knowing: DoRA decomposes the LoRA update into magnitude and direction components and consistently outperforms vanilla LoRA at the same rank. For most tasks, DoRA-rank-16 closes the gap to full fine-tuning entirely.
When SFT is enough
For most production specialization tasks, SFT is the whole game. If the task is well-defined (“rank these documents”, “classify this ticket”, “extract these entities”), there’s a single right answer per input, and labels are available — SFT on (input, label) pairs gives you the model. RLHF/DPO add value primarily in open-ended tasks where many plausible answers exist and the question is “which is preferred”, not “what’s the answer”.
Practical concerns
Catastrophic forgetting . Aggressive SFT erases capabilities the base model had. Mix in pre-training-style data, use small learning rates, and prefer LoRA over full fine-tuning to mitigate.
Overfitting on small data. With 1K examples and 7B parameters, you can memorize. Use validation early-stopping and weight decay.
Format leakage. The model picks up on superficial patterns — exact token sequences in your instruction template — and fails to generalize to slightly different phrasings. Diversify input phrasings.
Loss masking. Make sure you’re not training the model to predict its own input. Most frameworks (TRL, Axolotl) do this by default; verify it.
SFT is the workhorse. Most of “fine-tune a model” in production retrieval is SFT — unsexy, well-understood, and the right tool for narrow specialization tasks.
Go further
Where does SFT sit in the post-pre-training stack?
Pre-training → SFT (instruction-following) → preference optimization (RLHF/DPO). SFT is the middle stage that turns a base model into something usable. Most production fine-tuning — domain adaptation, task specialization, custom rerankers — is also SFT, just with task-specific data instead of generic instructions.
How is SFT different from pre-training mechanically?
Same loss (cross-entropy on next-token prediction), much smaller learning rate (~10× to 100× smaller), much smaller dataset (thousands to millions of examples vs trillions of tokens), often with loss masking on the input so only target tokens contribute to gradient.
When does SFT need to be replaced by something else?
When you care more about which of several plausible outputs is preferred than about producing any plausible output — that's where RLHF/DPO take over. SFT teaches what shape the answer should take; preference methods teach which answer is better.