Data Labeling

Also known as: annotation, human labeling, ground truth labeling

TL;DR

Human-in-the-loop annotation of training data — crowdsourced (Mechanical Turk, Scale, Surge), expert (domain specialists), and gold-standard sets. Distinct from RLHF preferences.

Data labeling is the human-in-the-loop production of supervision signal — pairing inputs with target outputs that a model learns to imitate. It covers everything from “is this image a cat” to “is this medical report describing pneumonia” to “rewrite this customer email in a friendlier tone”. The labeling layer is where most ML projects either succeed or fail: a 50K-example dataset with sharp, consistent labels beats a 500K-example dataset with mediocre ones at almost every task.

The three tiers

Labeling tiers and where each fits

Crowdsourced. Mechanical Turk, Scale AI, Surge HQ, Toloka. Cheap (cents per label), fast (millions per week), noisy. Best for tasks any educated adult can do — image classification, sentiment, simple span annotation. Inter-annotator agreement typically 0.6-0.8 Cohen’s kappa with three-way redundancy and majority vote.
Expert. Radiologists labeling chest X-rays, lawyers labeling contract clauses, native speakers of Yoruba labeling sentiment. 10-100x the per-label cost of crowd, but accessible labels crowds simply cannot produce. Agreement reaches 0.85-0.95 on well-scoped tasks.
Gold-standard. Small (50-500 examples), hand-curated, often labeled by the model team itself or trusted reviewers. The team’s calibration set — used to spot-check crowd labels, audit benchmark performance, and validate every other supervision pipeline. Per-example cost is high; total volume is low.

A real labeling stack uses all three. Gold sets calibrate; expert labels cover the hard cases the crowd can’t reach; crowd labels provide bulk volume. Crashing the wrong tier into the wrong task — using crowd labels for radiology, or paying for expert labels on basic image classification — is the most common labeling-budget mistake.

Distinguishing labeling from RLHF preferences

Labeling produces a target value for an input: the correct class, the correct span, the correct rewritten sentence. RLHF preference data produces a comparison: of two outputs A and B for a given prompt, which is better. They are both human-supplied supervision, but they feed different loss functions and different stages of the post-training stack.

Modern post-training pipelines almost always use both. Supervised fine-tuning consumes labeled (input, correct-output) pairs. DPO or PPO then consumes pairwise preference pairs to nudge the model toward outputs annotators rated higher. The two stages are complementary — SFT teaches the shape of the right answer, preferences refine the ranking among shape-correct candidates.

IAA — Cohen’s kappa for two annotators, Fleiss’s for three or more — measures how often labelers pick the same answer on the same example, corrected for chance. Below 0.4 is “poor”; 0.4-0.6 “moderate”; 0.6-0.8 “substantial”; above 0.8 “near-perfect”. It bounds your model’s achievable accuracy: if labelers agree 70%, no model can score above ~85% on this task because half the disagreements are stochastic.

The fix when IAA is low is task redesign, not more labelers. Tighter instructions with worked examples, narrower label categories (binary instead of 5-way), explicit “ambiguous” buckets, and calibration training where new labelers must hit 95% agreement with a gold set before their labels enter production. Spending more on labelers without redesigning the task is the single most common waste in labeling budgets.

The annotation UI wires a frontier LLM in. For each example, the LLM produces a candidate label — span boundaries, a class, a rewritten sentence — and the annotator sees the input, the draft, and accept/edit/reject affordances.

Throughput goes up 5-10x because typing is the slow step. IAA also goes up because annotators react to the same anchor instead of producing answers from scratch. The catch: if the LLM is systematically biased, human verification fails to catch that bias because labelers anchor on the draft. Operational fix: every Nth example, suppress the draft and force the annotator to label from scratch; compare distributions to surface drift.

The shape of labeling work has shifted, but humans haven’t left the loop. They’ve just moved from typing answers to auditing model-generated drafts and concentrating their effort on the cases the model gets wrong — which is, on most tasks, the long-tail few percent that determine whether the resulting model ships.

Go further

Where does labeling end and RLHF preference data begin?

Labeling produces a target value for an input — the correct class, the correct span, the correct answer. RLHF preference data produces a comparison — A is better than B for this prompt. Both are human-supplied supervision, but the loss functions consume them differently. Modern post-training pipelines use both: SFT on labeled correct outputs, then DPO or PPO on preference pairs.

RLHF Pairwise preference Supervised fine-tuning

What's the difference between crowd, expert, and gold labels in practice?

Crowd labels (MTurk, Scale, Surge) are cheap, fast, and noisy — get inter-annotator agreement of 0.6-0.8 Cohen's kappa, mostly used for surface-level tasks. Expert labels (radiologists, lawyers, native speakers of low-resource languages) are 10-100x more expensive but get to 0.85-0.95 agreement on tasks crowds can't do at all. Gold sets are small (50-500 examples) hand-curated benchmarks that the team trusts absolutely and uses to calibrate everything else.

MS MARCO Synthetic data generation

Has synthetic data made human labeling obsolete?

No, but it has changed the shape of the work. The pure 'human writes label from scratch' workflow is shrinking. The dominant pattern is now 'LLM drafts a candidate label, human verifies or corrects it', which is 5-10x faster per example and produces higher inter-annotator agreement than from-scratch labeling. Pure human labeling survives where the LLM is known to be wrong — adversarial cases, low-resource languages, expert domains.

Synthetic data generation Weak supervision

← All concepts

The best AI teams build with ZeroEntropy models

Book Demo View docs