Also known as: few-shot, in-context examples, demonstration prompting
TL;DR
Few-shot prompting is the technique of including 2-5 input/output examples in the prompt to demonstrate the desired behavior. It works dramatically better than describing the rule in words because the model picks up on format, edge cases.
Few-shot prompting is the practice of including a handful of input/output example pairs in the prompt before the actual query. The model — leveraging in-context learning — infers the pattern from the demonstrations and applies it to the new input. Two to five examples is the typical range; more rarely helps and consumes context window .
The basic shape
Extract the company name and founding year as JSON.Input: Acme Inc. was founded in Cleveland in 1982.Output: {"company": "Acme Inc.", "year": 1982}Input: Founded in 2014, Beta Labs operates in Berlin.Output: {"company": "Beta Labs", "year": 2014}Input: Gamma Corp's IPO followed its 2009 launch.Output: {"company": "Gamma Corp", "year": 2009}Input: <new input here>Output:
The model sees three demonstrations of the input format, the output schema, and the implicit rules (e.g., year as integer, not string), and reliably continues the pattern.
Why it works so well
The technique is dramatically more reliable than instruction-only prompting because:
Format is shown, not described. “Return JSON with keys company and year” leaves a dozen ambiguities (quotes around year? null when missing? trailing whitespace?). Examples answer all of them implicitly.
Edge cases are demonstrated. Including one example with a missing field shows the model how you want missing data handled — far more reliably than a sentence of prose.
Tone and register transfer. If your examples are terse and clinical, the model continues in that register; if they’re warm and conversational, it matches.
The pattern is learnable from few examples. This is the in-context learning phenomenon — the same statistical machinery that makes LLMs work as next-token predictors lets them learn new mappings from a handful of input-output pairs.
Examples beat instructions because the model is a pattern-matcher first and an instruction-follower second. Show the shape, the format, the edge case — the model continues the pattern more reliably than it interprets a sentence describing it.
LLMs are sensitive to position-in-context (recency effects, primacy effects, and the broader pattern of context rot ), and few-shot examples are no exception. Two well-documented patterns: outputs tend to mimic the last example more strongly than the first (recency bias), and label imbalance in the example list creates a prior on the answer (if four out of five examples answer “yes”, the model is biased toward “yes” on the test input). Robust prompts shuffle example order across calls or balance label distribution explicitly. The Calibrate-Before-Use paper showed up to 30% accuracy swings just from reordering.
Selecting examples
Static few-shot — the same fixed examples on every call — is the default and usually fine. For harder tasks the upgrade is dynamic few-shot: at query time, retrieve the most similar past (input, gold-output) pairs from a labeled corpus via embedding similarity , and inject those as the demonstrations. The prompt now adapts to each query, behaving like a tiny RAG system over your training set. This pattern routinely beats static few-shot on classification, extraction, and code-generation tasks.
When to graduate to fine-tuning
Few-shot is the right tool for medium-volume, medium-complexity tasks where you have under 100 labeled examples and the task may evolve. Once you have thousands of (input, output) pairs and the task is stable, fine-tuning wins on every dimension: latency (no example tokens to process), cost (smaller prompts), and accuracy (the pattern is in weights, not pasted into context). The classic trajectory: prototype with few-shot, ship with a fine-tuned small model.
When few-shot is the right tool
Sub-100 labeled examples and the task is still being defined
Output schema drifts week-to-week as product requirements move
Cold-start: you need something working today, training pipeline can come later
Long-tail tasks where each variant has too few examples for fine-tuning to converge
Explicit per-customer or per-tenant adaptation where weights aren’t shareable
Go further
Why do examples beat instructions?
Natural language is ambiguous; examples are concrete. When you write 'extract the company name', the model doesn't know if 'Acme Inc.' or 'Acme' is preferred, whether to handle missing values, or what format to return. Three examples answer all of those at once, in the format you actually want. It's the same reason API docs include curl snippets, not just OpenAPI specs.
For static prompts: pick examples that span the edge cases your task hits — diverse inputs, tricky outputs, the failure modes you've seen. For dynamic prompts (so-called dynamic few-shot): retrieve the top-k most-similar past examples to the current input via embedding similarity. The latter behaves like a retrieval-augmented prompt and is usually 2-5x more reliable on hard tasks.
When the task is simple enough that zero-shot already nails it (basic summarization, Q&A on retrieved context), examples just consume tokens. When the task is hard enough that no number of examples teaches it (anything requiring genuinely new reasoning capability), you need fine-tuning. Few-shot's sweet spot is medium-complexity tasks with a learnable but hard-to-articulate pattern.