In-Context Learning

Also known as: ICL, in-context adaptation, context-based learning

TL;DR

In-context learning (ICL) is the empirical phenomenon that LLMs can adapt to new tasks from examples in the prompt — without any weight updates.

In-context learning (ICL) is the surprising and load-bearing capability of to learn new tasks from a handful of examples in the prompt — without any weight updates, optimization, or training. Show a frozen GPT-5 three input/output pairs of a task it has never been explicitly trained on, and it usually generalizes to new inputs. This is what makes work.

IN-CONTEXT LEARNINGLearning without a training step.MODEL WEIGHTSUNCHANGED — ∂L/∂θ = 0PROMPT · CONTEXT WINDOWCAPITALIZE LAST WORDINSTRUCTIONCapitalize the last word.the cat sat on the matthe cat sat on the MATpack my box with jugspack my box with JUGSthe quick brown foxthe quick brown FOXTESTwe open at dawn?LLM · FROZENone forward passNO OPTIMIZER · NO BACKWARDOUTPUTwe open at DAWNTASKS LEARNED FROM CONTEXTCAPITALIZE LAST WORDTRIPLE-LABEL SENTIMENTREVERSE EACH WORDNO TRAINING STEP · NO GRADIENT · NO WEIGHT UPDATE

What’s actually happening

Weights don’t change. The context changes — and through , the model conditions every prediction on the entire prompt, including the examples. The transformer reads the demonstrations, infers what mapping is being demonstrated, and continues the pattern.

The mechanism is partially understood. Transformers can implement linear regression and gradient-descent-like updates on activations (not weights) inside a single forward pass — meaning the forward pass on a prompt with examples is mechanistically similar to running an optimizer over those examples in feature space. ICL is plausibly approximate Bayesian inference over latent task representations, implicit meta-learning, or both.

Garg et al. (2022) and Akyürek et al. (2022) showed transformers trained from scratch on linear-regression demonstrations learn to internally simulate one or two steps of gradient descent. Concretely: the attention pattern over the demonstration tokens computes the same update as a least-squares solver applied to those points. The model’s weights encode an algorithm for in-context inference, not the answer to any specific task. That’s the cleanest theory we have for why scaling demonstrations helps and why structurally similar tasks transfer — the model is running a tiny optimizer in its head, and demonstrations are its training set.

Why it emerged

ICL is an emergent capability of scale. Smaller language models do not exhibit it well; the ability turns on sharply around the 10B-parameter range and improves with scale and pre-training data diversity. The leading hypothesis: pre-training on heterogeneous corpora — books, code, web documents that contain implicit (instruction, examples, output) structures — exposes the model to many task-like patterns, and at sufficient scale the model learns the meta-pattern of recognizing and completing them.

The empirical quirks

ICL is real but ragged:

  • Order sensitivity. Permuting the order of demonstrations can swing accuracy by 5-15 points — a recency-and-position bias with no clean theoretical account.
  • Recency bias. The last example influences the prediction disproportionately. Production systems sometimes rotate or randomize the order at evaluation time.
  • Format dominance. The model often picks up on surface format (capitalization, whitespace, punctuation) more strongly than the underlying mapping. Inconsistent formatting in examples destroys ICL accuracy.
  • No new capabilities. ICL specializes the model toward a task the model can already approximately do; it doesn’t unlock skills the base model lacks. If the base model can’t multiply 6-digit numbers, no number of demonstrations teaches it to.

ICL is rented specialization. Fine-tuning is owned specialization.

The relationship to fine-tuning

ICL and are two ways to specialize a base model:

ICLFine-tuning
UpdatesNone — context onlyWeight changes
PersistencePer-requestPermanent
CostToken cost on every callOne-time training, cheap inference
CapacityBounded by context windowEffectively unbounded
Accuracy ceilingLowerHigher

ICL is right when the task is exploratory, low-volume, or evolving. Fine-tuning is right when the task is stable, high-volume, and the per-call token cost of carrying examples in context becomes a tax.

Go further

How is in-context learning actually learning?

It's learning in the loose sense — the model's behavior changes based on what's in the context — but no weights update. The mechanism is the model performing implicit pattern-matching and approximate inference over the examples via attention. There's a research direction showing transformers can implement gradient-descent-like updates inside the forward pass on the activations; that work suggests ICL is closer to actual learning than naive intuition suggests, but it's still nothing like fine-tuning.

Why does ICL get better at scale?

Empirically, ICL ability emerges sharply around the 10B-parameter mark and improves with scale. The leading hypothesis is that pre-training on heterogeneous data exposes the model to many implicit (instruction, demonstration, response) structures, and at sufficient scale the model learns to recognize and complete those structures. Smaller models can do zero-shot, but few-shot generalization is a scale-emergent capability.

When does in-context learning fail?

Several known failure modes. Order sensitivity: shuffling the same examples can change accuracy by 10+ points. Recency bias: the last example often dominates the prediction. Long-context degradation: ICL accuracy drops sharply when examples sit far back in the [context window](/concepts/context-window/). And capability ceilings: ICL never teaches the model genuinely new skills the base model can't already approximate.

ZeroEntropy
The best AI teams build with ZeroEntropy models
Follow us on
GitHubTwitterSlackLinkedInDiscord