The naive validation — train two models, one on synthetic and one on real, compare downstream metrics — works but is expensive when “real” is the bottleneck you’re trying to avoid. The pragmatic approach is multi-tiered.
First, distributional checks. Compute embeddings of synthetic queries and real queries; visualize via UMAP or t-SNE; check that the synthetic distribution covers the real distribution (subset relation) rather than diverging into off-manifold regions. Compute per-dimension marginal distributions for length, vocabulary novelty, syntactic complexity (parse-tree depth, mean dependency length); flag any dimension where synthetic and real diverge by more than a couple standard deviations.
Second, judge-based quality scoring. Use a held-out frontier LLM (different from the one that generated the data) to score each synthetic example for relevance, naturalness, and difficulty. The score distribution should match what a human reviewer would produce on a small calibration sample. Discard the bottom 10-20% by judge score before training.
Third, the actual training loop. Train on synthetic; evaluate on a held-out set of real production examples (this is the eval set you can never train on). If synthetic-trained validation loss on real data tracks training loss reasonably tightly, the synthetic distribution is in the same neighborhood as production. If validation loss plateaus far above training loss, the synthetic distribution has the wrong shape.
The unforgiving rule is that a held-out set of real data is required at evaluation time. Synthetic-only evaluation is circular — the teacher LLM and the student inherit the same blind spots, so synthetic-judged synthetic data passes synthetic evals trivially while failing in production.