Synthetic Data Generation

Also known as: LLM-generated training data, synthetic supervision, model-generated data

TL;DR

Using a frontier LLM to generate training data for a smaller specialized model. The dominant data-creation method in 2026 — every modern open-weight instruct model and most production-tuned rerankers train on synthetic data, including zerank-2.

In 2026, synthetic data — training data generated by an LLM rather than authored by humans — is the default supervision source for almost every specialized model. Open-weight instruct models like Tülu, OpenHermes, and Llama-3-Instruct lean overwhelmingly on it. Custom-trained rerankers, classifiers, and embedders use it as the primary corpus. The reason is straightforward: human annotation is the bottleneck, and an LLM can produce 1000× more data per dollar.

The core recipes

The core recipes
  • Self-Instruct (Wang et al., 2022). Seed with ~200 hand-written tasks. Prompt an LLM to generate similar tasks; filter for diversity and quality; iterate. Output: 50K+ instruction-following examples.
  • Evol-Instruct (Xu et al., 2023). Same shape but each generation step evolves the instructions to be harder, more specific, more constrained. Produces sharper distributions of instruction difficulty.
  • Magpie (Xu et al., 2024). Exploit the fact that an instruct LLM, given just a system prompt, will hallucinate a user instruction it would have answered. Sample millions of these for free.
  • LLM-as-rater pipelines. Generate or sample candidate outputs; have a frontier LLM judge them; train a smaller model on the judgments. The shape behind RLAIF and .
  • Persona-conditioned generation. Sample synthetic users with diverse personas (occupation, expertise level, language), have each generate task data. Useful for diversity in the absence of real distributional data.

The default failure mode of LLM-generated training data is mode collapse — even with carefully diverse seeds, the teacher tends to regress toward a narrow distribution of “things it likes to say”. Output the same prompt 1000 times to a frontier model and the variance is high; output 1000 different prompts and the variance is lower than you’d expect because the model has stylistic and topical attractors it falls into.

The standard interventions are diversity filtering and persona conditioning. Diversity filtering computes pairwise embedding similarity over generated examples and rejects any example within distance epsilon of an existing one. The threshold matters — too tight and you keep too many near-duplicates; too loose and you reject genuinely novel examples that just share surface features. Persona conditioning prepends a randomized persona description (“You are a 45-year-old structural engineer reviewing seismic codes…”) that biases the LLM into different stylistic and topical regions for each generation.

For instruction data specifically, Evol-Instruct’s deepening operations (add constraint, increase reasoning depth, broaden topic, deepen topic, concretize, increase complexity) function as forced perturbations away from the model’s central tendency. Each “evolved” instruction is, by construction, a step away from where the model wanted to go.

The hardest diversity dimension to capture synthetically is failure modes. Real users make mistakes the model wouldn’t naturally generate — typos, ambiguous references, malformed inputs. Production-quality synthetic data pipelines explicitly inject these via separate generation passes (typo perturbation, ambiguity injection) or, more reliably, by mixing in a held-aside fraction of real production traffic.

The naive validation — train two models, one on synthetic and one on real, compare downstream metrics — works but is expensive when “real” is the bottleneck you’re trying to avoid. The pragmatic approach is multi-tiered.

First, distributional checks. Compute embeddings of synthetic queries and real queries; visualize via UMAP or t-SNE; check that the synthetic distribution covers the real distribution (subset relation) rather than diverging into off-manifold regions. Compute per-dimension marginal distributions for length, vocabulary novelty, syntactic complexity (parse-tree depth, mean dependency length); flag any dimension where synthetic and real diverge by more than a couple standard deviations.

Second, judge-based quality scoring. Use a held-out frontier LLM (different from the one that generated the data) to score each synthetic example for relevance, naturalness, and difficulty. The score distribution should match what a human reviewer would produce on a small calibration sample. Discard the bottom 10-20% by judge score before training.

Third, the actual training loop. Train on synthetic; evaluate on a held-out set of real production examples (this is the eval set you can never train on). If synthetic-trained validation loss on real data tracks training loss reasonably tightly, the synthetic distribution is in the same neighborhood as production. If validation loss plateaus far above training loss, the synthetic distribution has the wrong shape.

The unforgiving rule is that a held-out set of real data is required at evaluation time. Synthetic-only evaluation is circular — the teacher LLM and the student inherit the same blind spots, so synthetic-judged synthetic data passes synthetic evals trivially while failing in production.

Why it works for specialization

The teacher LLM is a generalist that can produce plausible task data across an arbitrary distribution. The student model only needs to inherit a narrow slice — the task you actually care about. Distilling that slice through synthetic data is enormously more efficient than human annotation, and the resulting student often beats the teacher on the narrow task because it can be specialized harder.

This is the central shape, with synthetic data as the bridge medium.

End-to-end synthetic supervision: a worked example

A representative production reranker pipeline runs end-to-end synthetic across all four stages:

  1. Queries. LLM-generated from corpus documents, conditioned on the document — “what question would this document answer?”. Realistic queries without crowdsourced annotation.
  2. Candidate documents. Hybrid retrieval over the corpus to produce candidate sets per query.
  3. judgments. Three frontier LLMs vote on (q, doc_A, doc_B) preferences. Their averaged probability is the synthetic supervision.
  4. Recovered scores. A over the pairwise graph produces continuous relevance targets — the final synthetic labels.

No human annotators in the loop. The entire training corpus is model-generated, which is why the methodology scales economically across millions of (q, d) pairs.

When synthetic data fails

  • Distribution mismatch. Synthetic queries from an LLM don’t always match real user queries. Always validate on a held-out set of real production queries.
  • Bias amplification. The student inherits the teacher’s biases and quirks. If the teacher always answers in markdown, the student will too — even when that’s wrong for the use case.
  • Long-tail blind spots. The teacher LLM has its own gaps; synthetic data can’t cover what the teacher doesn’t know. Mix in human-curated examples for hard or domain-specific cases.
  • Closed-loop collapse. Don’t train iteratively on your own model’s outputs without re-anchoring to real data. The variance shrinks; the distribution drifts.

Synthetic data is the bridge from “frontier LLMs are smart and expensive” to “deploy small specialized models that inherit their smarts at a fraction of the cost”.

Go further

Why isn't synthetic data 'cheating'?

It's distillation in another guise. The frontier LLM that generates the data has absorbed the supervision signal from billions of human-written tokens; piping its outputs into a smaller model's training is a way to compress that supervision. The information has to come from somewhere — synthetic data routes it through a teacher LLM rather than direct annotation.

What about model collapse — doesn't training on synthetic data degrade quality over generations?

Yes if you naively train one synthetic model on the previous one's output and iterate. No if you mix synthetic data with real data and use it for narrow specialization. Production recipes (Tülu, OpenHermes, Magpie) all blend; the collapse failure mode requires a closed loop with no real-data anchor.

How does ZeroEntropy use synthetic data?

zerank-2's training data is synthetic at multiple levels: queries are LLM-generated from a corpus, pairwise relevance judgments come from a frontier-LLM ensemble (Claude, GPT, Gemini), and the recovered Thurstone scores are themselves a kind of synthetic continuous label. The supervision pipeline is end-to-end model-generated.

ZeroEntropy
The best AI teams build with ZeroEntropy models
Follow us on
GitHubTwitterSlackLinkedInDiscord