Mean Reciprocal Rank is the average of 1/rank across queries, where rank is the position of the first relevant document. Heavily front-loaded — only the top result really matters.
MRR (Mean Reciprocal Rank) scores a ranking by the position of its first relevant document. For each query:
Find the rank of the first relevant document.
Take the reciprocal: .
Then average across queries. Position 1 contributes 1.0; position 2, 0.5; position 5, 0.2; no relevant doc, 0.
When to use MRR
MRR is most useful when the user only cares about the first relevant result. Classical use cases:
Top-1 consumers where MRR is the right metric
Question answering — one correct answer; if it’s at the top, you’re done
Navigational search — user is looking for a specific page
Customer support chatbots — one document answers the question
RAG with single-passage prompts where only the top result feeds the LLM
Tool-routing agents that pick exactly one tool per turn
For corpora where multiple documents are relevant per query and the user is browsing through them, NDCG@K is a better fit because it credits multiple relevant docs at the top, not just the first one.
For a query with exactly one relevant doc at rank , average precision is just — the reciprocal rank. So MAP and MRR coincide query-by-query in this regime, and a benchmark with mostly single-relevant queries (like much of MS MARCO) collapses the two metrics. This is why papers that headline MRR on MS MARCO are also implicitly headlining MAP, and why arguing about which metric is “better” is moot for that specific dataset shape.
MRR vs NDCG@K vs Recall@K
MRR: only first relevant doc, very front-loaded.
NDCG@K: all relevant docs in top-K, position-discounted.
Recall@K: any relevant doc in top-K, position-blind.
Report all three plus per-K curves. Each answers a different question; pick the one that matches your downstream consumer before optimizing.
Limitations
Binary by construction — MRR doesn’t use graded relevance. A “perfectly relevant” doc at position 1 and a “marginally relevant” doc at position 1 both contribute 1.0.
Sensitive to single document quality — one query where the relevant doc lands at position 50 (contributing 0.02) drags the mean significantly. Trim outliers or use median for a more robust read.
Doesn’t reward depth — your model could put one perfect result first and complete garbage at positions 2-10 and MRR wouldn’t notice. NDCG@10 would.
Go further
When should I prefer MRR over NDCG@10?
When your downstream consumer reads exactly one result — a QA system pulling the top passage, a routing agent picking one tool, a navigational search where the user clicks the first hit. NDCG is for ranked-list consumers; MRR is for top-1 consumers.
Reciprocal rank decays slowly past position 10, so once the relevant doc is buried the per-query contribution is near-zero either way. But a single query where it lands at position 1 (1.0) versus position 2 (0.5) is a half-point swing. Trim long-tail queries or report median-RR alongside MRR for stability.
Not directly — MRR is binary by construction. If you have graded labels, NDCG@K is the natural fit. You can hack a graded MRR by treating only top-grade docs as 'relevant', but you're throwing away signal that NDCG would use.