Precision@K is the fraction of the top-K returned documents that are relevant. The classical IR metric retrieval moved away from in favor of NDCG, but still the right choice when every position in the result list carries equal weight.
Precision@K is the fraction of the top-K results that are relevant:
For a single query, if 7 of the top 10 results are relevant, Precision@10 = 0.7. Average across queries to get a corpus-level number.
It’s the natural “of what I returned, how much was good?” metric — the mirror image of Recall@K , which asks “of what was good, how much did I find?”.
Why retrieval moved past precision
Classical IR papers from the 1990s lean heavily on Precision@K and MAP . Modern retrieval evaluations lean on NDCG@K instead. The reason is position weighting.
Precision@10 counts a relevant doc at position 1 and a relevant doc at position 10 equally. Users don’t. Eye-tracking and click data show attention drops sharply with rank. NDCG bakes that into the metric with a logarithmic discount; precision treats the result list as an unordered set of K items.
For ranked-list consumers — users scrolling search results, RAG pipelines passing top-K to an LLM that weights them by order — NDCG is the more honest metric. Precision is only correct when every returned position is treated equally downstream.
When precision is still the right metric
When Precision@K is the right metric
Equal-weight downstream consumption. All K results get reviewed, summarized, or concatenated together. A RAG step that joins the top-5 chunks into one prompt, with no per-chunk weighting, is precision-shaped.
Human triage. A reviewer audits every flagged result. They don’t read more carefully because something is at position 1 — they spend the same time per item.
Binary relevance and small K. When K is small (3-5) and labels are binary, precision is easier to interpret than NDCG and the position-weighting difference is small.
Outside those cases, NDCG@K or MRR is usually the better choice.
Order matters in the prompt’s structure but rarely in the LLM’s attention across short contexts. With 5 chunks of ~500 tokens each, the generator reads all 2500 tokens; the position-1 chunk doesn’t get a discount that the position-5 chunk gets a discount on, the way an end user does scrolling Google. What matters is whether the answer text appears in any of the chunks at all — a precision-shaped question. Context rot (originally quantified by Liu et al. 2023 as “lost in the middle”) does introduce some position weighting at long context, but for short concatenated RAG contexts (under ~3000 tokens) the LLM treats the chunk set roughly as a bag. NDCG would punish you for ordering choices that don’t actually affect answer quality. Precision tracks what the system is doing.
Precision vs recall
The two measure orthogonal things:
Precision — “of what we returned, how much was relevant?”
Recall — “of what was relevant, how much did we return?”
A trivial system returning every document has perfect recall and terrible precision. A system returning only its single most confident result has high precision and terrible recall. Tuning K (or a score threshold) sweeps the curve between them; F1 score is the harmonic mean if you want a single number.
Pick the metric that matches how your downstream consumer reads the result list. Precision when they read everything; NDCG when they read top-down.
Go further
Why did retrieval mostly move away from precision toward NDCG?
Precision@K treats position 1 and position K identically — a relevant doc anywhere in the top-K counts the same. Real users read top-down and tire fast, so position 1 is worth much more than position 10. NDCG bakes that with logarithmic discount; Precision doesn't. For ranked-list consumers, NDCG is the more honest metric.
When every position in the top-K is consumed equally — a classifier emitting K candidates that all get manual review, a retrieval audit where humans triage every result, or a multi-document RAG step where all K passages are concatenated as context. There the user reads them all, so position weighting is misleading.
How does Precision@K interact with the precision-recall tradeoff?
Precision rises as K shrinks (you only keep the most-confident candidates) and falls as K grows. Recall does the opposite. The classical curve sweeps K (or a score threshold) and reports both — area under it is closely related to MAP. Tune K based on the downstream consumer's tolerance for noise.