✨ Join the Context Engineers Discord community for an exclusive talk with the ZeroEntropy founders this Friday!

Why Evaluation Metrics for Reranking Matter in Search Quality

Aug 12, 2025

Why Evaluation Metrics for Reranking Matter in Search Engine Performance

Why Evaluation Metrics for Reranking Matter in Search Engine Performance

Introduction

When you search for "best headphones for running," you expect relevant results at the top. But how do we measure whether a reranking algorithm is actually improving search quality?

Evaluation metrics are quantitative measures that tell us:

Are the right results appearing in the top positions?
How many relevant results did we find?
How well does the ranking match user expectations?

This guide covers the most important metrics used in production search systems, with working Python code for each one.

What Is Reranking and Why Does It Happen?

Reranking is the process of reordering initial search results to improve overall search accuracy. Even if a search engine finds relevant pages, the initial ranking may not be perfect. Reranking steps in to reorder the results so that the most accurate, useful, and high-quality content is shown first. This is especially important in AI-powered search systems, recommendation engines, and e-commerce platforms, where showing the right item first can make all the difference.

Why Evaluation Metrics Are Needed

Without proper evaluation, there’s no way to know if reranking is actually making the results better. Evaluation metrics are used to measure how close the reordered results are to what users want. They help developers figure out whether the changes in ranking lead to improved accuracy, higher user satisfaction, and better click-through rates.

Common Metrics Used in Reranking

Some popular metrics include:

Precision and Recall: Precision checks how many of the top results are relevant, while recall checks how many relevant results were found in total.
Mean Reciprocal Rank (MRR): This measures how high the first correct result appears in the ranking.
Normalized Discounted Cumulative Gain (NDCG): A more advanced metric that gives higher value to relevant results appearing at the top of the list.

These metrics give developers a clear way to compare different reranking algorithms and choose the one that delivers the best experience for users.

Precision @ K

What It Measures

Precision@K answers: "Of the top K results I returned, what percentage are actually relevant?"

This metric focuses on quality over quantity - it only cares about whether you're showing relevant results in the top K positions.

Formula

Precision@K = (Number of Relevant Documents in Top K) / K

Python Implementation

def precision_at_k(results: list[dict], k: int) -> float:
"""
Calculate Precision@K
Args:
    results: List of dicts with 'doc_id' and 'relevant' (bool)
    k: Number of top results to consider
    
Returns:
    Precision score (0.0 to 1.0)
"""
if k <= 0 or len(results) == 0:
    return 0.0

top_k = results[:k]
relevant_count = sum(1 for doc in top_k if doc['relevant'])

return relevant_count / k

Real-World Example

Query: "best headphones for running"

Rank	Document	Relevant?
1	Waterproof Sport Earbuds Review	✅ YES
2	Best Running Headphones 2024	✅ YES
3	Office Headphones Comparison	❌ NO
4	Wireless Earbuds for Athletes	✅ YES
5	Gaming Headset Guide	❌ NO

Calculation:

Relevant in top 5: 3 documents
Precision@5 = 3/5 = 0.60 (60%)

Key Insights

✅ Strengths:

Simple and intuitive
Focuses on top results (what users actually see)
Easy to explain to stakeholders

❌ Limitations:

Doesn't care about position (rank 1 vs rank 5 treated equally)
Ignores how many total relevant docs exist
Can be misleading if K is too small or too large

Recall @ K

What It Measures

Recall@K answers: "Of all relevant documents that exist, what percentage did I find in the top K results?"

This metric focuses on completeness - are you finding all the relevant results?

Formula

Recall@K = (Number of Relevant Documents in Top K) / (Total Relevant Documents)

Python Implementation

def recall_at_k(results: list[dict], k: int, total_relevant: int) -> float:
"""
Calculate Recall@K
Args:
    results: List of dicts with 'doc_id' and 'relevant' (bool)
    k: Number of top results to consider
    total_relevant: Total number of relevant docs in entire corpus
    
Returns:
    Recall score (0.0 to 1.0)
"""
if total_relevant <= 0 or k <= 0:
    return 0.0

top_k = results[:k]
relevant_found = sum(1 for doc in top_k if doc['relevant'])

return relevant_found / total_relevant

Real-World Example

Scenario: Database contains 10 relevant documents about "running headphones"

Your search returns:

Top 5 results: Found 3 relevant docs → Recall@5 = 3/10 = 0.30 (30%)
Top 10 results: Found 6 relevant docs → Recall@10 = 6/10 = 0.60 (60%)
Top 20 results: Found 8 relevant docs → Recall@20 = 8/10 = 0.80 (80%)

Key Insights

✅ Strengths:

Shows how complete your results are
Critical for research and legal search (need to find ALL relevant docs)
Naturally increases with K

❌ Limitations:

Requires knowing total relevant docs (often impossible in production)
Higher recall often means lower precision
Less important for user-facing search (users rarely look past page 1)

F1 Score @ K

What It Measures

F1@K is the harmonic mean of Precision@K and Recall@K. It balances both metrics, giving you a single score that accounts for both quality and completeness.

Formula

F1@K = 2 × (Precision@K × Recall@K) / (Precision@K + Recall@K)

Python Implementation

def f1_at_k(results: list[dict], k: int, total_relevant: int) -> float:
"""
Calculate F1 Score@K
  Args:
    results: List of dicts with 'doc_id' and 'relevant' (bool)
    k: Number of top results to consider
    total_relevant: Total number of relevant docs in corpus
    
Returns:
    F1 score (0.0 to 1.0)
"""
precision = precision_at_k(results, k)
recall = recall_at_k(results, k, total_relevant)

if precision + recall == 0:
    return 0.0

f1 = 2 * (precision * recall) / (precision + recall)
return f1

Real-World Example

Given:

Precision@5 = 0.60 (3 out of 5 results are relevant)
Recall@5 = 0.30 (found 3 out of 10 total relevant docs)

Calculation:

F1@5 = 2 × (0.60 × 0.30) / (0.60 + 0.30)
     = 2 × 0.18 / 0.90
     = 0.36 / 0.90
     = 0.40

Key Insights

✅ Strengths:

Single metric that balances precision and recall
Penalizes extreme imbalance (high precision but low recall, or vice versa)
Useful for comparing systems with different precision/recall trade-offs

❌ Limitations:

Less interpretable than precision or recall alone
Assumes precision and recall are equally important (not always true)
Still requires knowing total relevant docs

Mean Reciprocal Rank (MRR)

What It Measures

MRR answers: "How high does the first relevant result appear in my ranking?"

This metric is perfect for scenarios where users need ONE good answer (e.g., question answering, navigational search).

Formula

For a single query:
RR = 1 / (rank of first relevant document)
For multiple queries:
MRR = (1/Q) × Σ(1 / rank_i)
where Q = number of queries, rank_i = position of first relevant result for query i

Python Implementation

def reciprocal_rank(results: list[dict]) -> float:
    """
    Calculate Reciprocal Rank for a single query
    
    Args:
        results: List of dicts with 'doc_id' and 'relevant' (bool)
        
    Returns:
        Reciprocal rank (0.0 to 1.0)
    """
    for rank, doc in enumerate(results, start=1):
        if doc['relevant']:
            return 1.0 / rank
    
    return 0.0  # No relevant results found


def mean_reciprocal_rank(queries_results: list[list[dict]]) -> float:
    """
    Calculate MRR across multiple queries
    
    Args:
        queries_results: List of result lists, one per query
        
    Returns:
        Mean reciprocal rank (0.0 to 1.0)
    """
    if not queries_results:
        return 0.0
    
    rr_sum = sum(reciprocal_rank(results) for results in queries_results)
    return rr_sum / len(queries_results)


# Example for single query
query_results = [
    {'doc_id': 'doc1', 'relevant': False},  # Rank 1
    {'doc_id': 'doc2', 'relevant': False},  # Rank 2
    {'doc_id': 'doc3', 'relevant': True},   # Rank 3 ← First relevant
    {'doc_id': 'doc4', 'relevant': True},   # Rank 4
]

rr = reciprocal_rank(query_results)
print(f"Reciprocal Rank: {rr:.3f}")  # Output: 0.333 (1/3)

# Example across multiple queries
all_queries = [
    [{'doc_id': 'd1', 'relevant': True}],   # First relevant at rank 1 → RR = 1.0
    [{'doc_id': 'd2', 'relevant': False}, 
     {'doc_id': 'd3', 'relevant': True}],   # First relevant at rank 2 → RR = 0.5
    [{'doc_id': 'd4', 'relevant': False},
     {'doc_id': 'd5', 'relevant': False},
     {'doc_id': 'd6', 'relevant': True}],   # First relevant at rank 3 → RR = 0.333
]

mrr = mean_reciprocal_rank(all_queries)
print(f"MRR: {mrr:.3f}")  # Output: 0.611 = (1.0 + 0.5 + 0.333) / 3

Real-World Example

Query 1: "capital of France"

Rank 1: Paris → RR = 1/1 = 1.0 ✨

Query 2: "python tutorial"

Rank 1: Ruby guide (irrelevant)
Rank 2: Python docs → RR = 1/2 = 0.5

Query 3: "best pizza NYC"

Rank 1: LA restaurants (irrelevant)
Rank 2: Chicago pizza (irrelevant)
Rank 3: NYC pizza guide → RR = 1/3 = 0.333

MRR = (1.0 + 0.5 + 0.333) / 3 = 0.611

Key Insights

✅ Strengths:

Perfect for "single answer" scenarios (QA, navigation)
Heavily weights top position (1st place much better than 2nd)
Easy to interpret: higher = relevant results appear earlier

❌ Limitations:

Only considers first relevant result (ignores all others)
Not suitable when users need multiple relevant results
Can be misleading if only one relevant doc exists per query

Normalized Discounted Cumulative Gain (NDCG)

What It Measures

NDCG is the gold standard for ranking evaluation. Unlike previous metrics that treat all relevant documents equally, NDCG allows for graded relevance (e.g., highly relevant, somewhat relevant, not relevant) and heavily penalizes placing relevant docs lower in the ranking.

Formula

Step 1 - Discounted Cumulative Gain (DCG):
DCG@K = Σ(rel_i / log2(i + 1))
where:
rel_i = relevance score of document at position i
i = rank position (1, 2, 3, ...)
log2(i + 1) = discount factor (penalizes lower positions)
Step 2 - Ideal DCG (IDCG):
IDCG@K = DCG of the perfect ranking (all docs sorted by relevance)
Step 3 - Normalized DCG:
NDCG@K = DCG@K / IDCG@K

Python Implementation

import math
from typing import List

def dcg_at_k(relevances: List[float], k: int) -> float:
    """
    Calculate Discounted Cumulative Gain at K
    
    Args:
        relevances: List of relevance scores (higher = more relevant)
        k: Number of top results to consider
        
    Returns:
        DCG score
    """
    dcg = 0.0
    for i, rel in enumerate(relevances[:k], start=1):
        dcg += rel / math.log2(i + 1)
    
    return dcg


def ndcg_at_k(relevances: List[float], k: int) -> float:
    """
    Calculate Normalized Discounted Cumulative Gain at K
    
    Args:
        relevances: List of relevance scores in retrieved order
        k: Number of top results to consider
        
    Returns:
        NDCG score (0.0 to 1.0)
    """
    # Calculate DCG for actual ranking
    dcg = dcg_at_k(relevances, k)
    
    # Calculate IDCG (ideal ranking - sorted by relevance descending)
    ideal_relevances = sorted(relevances, reverse=True)
    idcg = dcg_at_k(ideal_relevances, k)
    
    if idcg == 0:
        return 0.0
    
    return dcg / idcg


# Example with graded relevance (0-3 scale)
# 3 = highly relevant, 2 = relevant, 1 = somewhat relevant, 0 = not relevant
actual_ranking = [3, 2, 0, 1, 2, 0, 3]  # Your system's ranking
k = 5

ndcg_score = ndcg_at_k(actual_ranking, k)
print(f"NDCG@{k}: {ndcg_score:.3f}")

# Let's see step-by-step calculation
print("\n--- Step-by-Step Calculation ---")
print(f"Actual ranking (top {k}): {actual_ranking[:k]}")

# Calculate DCG
dcg = 0.0
for i, rel in enumerate(actual_ranking[:k], start=1):
    contribution = rel / math.log2(i + 1)
    dcg += contribution
    print(f"Rank {i}: relevance={rel}, discount=1/log2({i}+1)={1/math.log2(i+1):.3f}, contribution={contribution:.3f}")

print(f"\nDCG@{k} = {dcg:.3f}")

# Calculate IDCG (perfect ranking)
ideal_ranking = sorted(actual_ranking, reverse=True)[:k]
print(f"\nIdeal ranking (top {k}): {ideal_ranking}")

idcg = 0.0
for i, rel in enumerate(ideal_ranking, start=1):
    contribution = rel / math.log2(i + 1)
    idcg += contribution
    print(f"Rank {i}: relevance={rel}, discount=1/log2({i}+1)={1/math.log2(i+1):.3f}, contribution={contribution:.3f}")

print(f"\nIDCG@{k} = {idcg:.3f}")
print(f"NDCG@{k} = DCG / IDCG = {dcg:.3f} / {idcg:.3f} = {ndcg_score:.3f}")

Real-World Example

Query: "best laptop for programming"

Your system returns (with graded relevance 0-3):

Rank	Document	Relevance	Discount (1/log2(rank+1))	Contribution
1	"Top Programming Laptops 2024"	3	1/log2(2) = 1.000	3.000
2	"Developer Laptop Guide"	2	1/log2(3) = 0.631	1.262
3	"Gaming Laptop Review"	0	1/log2(4) = 0.500	0.000
4	"Budget Coding Laptops"	1	1/log2(5) = 0.431	0.431
5	"MacBook Pro for Developers"	2	1/log2(6) = 0.387	0.774

DCG@5 = 3.000 + 1.262 + 0.000 + 0.431 + 0.774 = 5.467

Ideal ranking (sorted by relevance): [3, 2, 2, 1, 0]

Rank	Relevance	Discount	Contribution
1	3	1.000	3.000
2	2	0.631	1.262
3	2	0.500	1.000
4	1	0.431	0.431
5	0	0.387	0.000

IDCG@5 = 3.000 + 1.262 + 1.000 + 0.431 + 0.000 = 5.693

NDCG@5 = 5.467 / 5.693 = 0.960 (96%)

This is a very good ranking! The system is close to optimal.

Key Insights

✅ Strengths:

Handles graded relevance (not just binary relevant/not relevant)
Heavily penalizes relevant docs appearing low in ranking (logarithmic discount)
Industry standard for ranking evaluation
Normalized (0-1 scale), easy to compare across queries

❌ Limitations:

Requires relevance judgments (expensive to obtain)
More complex to calculate and explain
Logarithmic discount might not match user behavior in all scenarios

Final Thoughts

In the world of search engines, delivering the right result at the right time is everything. Evaluation metrics provide the tools to measure and improve that ability. By tracking these metrics, developers can fine-tune their systems to make sure users always get the best possible results.

Get started with

Our retrieval engine runs autonomously with the

accuracy of a human-curated system.

Our retrieval engine runs autonomously with the

accuracy of a human-curated system.

Our retrieval engine runs autonomously with the

accuracy of a human-curated system.

Start Now

View Docs

GitHub

Discord

Slack

Enterprise

LegalBench-RAG, the First Open-Source Retrieval Benchmark for the Legal Domain

Nov 29, 2024

LegalBench-RAG is the first open-source benchmark for legal RAG retrieval—6,800+ queries, 79M+ characters, human-annotated spans. Evaluate legal AI today.

LegalBench-RAG, the First Open-Source Retrieval Benchmark for the Legal Domain

Nov 29, 2024

LegalBench-RAG is the first open-source benchmark for legal RAG retrieval—6,800+ queries, 79M+ characters, human-annotated spans. Evaluate legal AI today.

LlamaChunk: A General and Cost Efficient Approach to Semantic Chunking

Dec 1, 2024

Learn how LlamaChunk delivers fast, accurate semantic chunking for RAG—outperforming regex and embedding methods with LLM-guided document splitting.

LlamaChunk: A General and Cost Efficient Approach to Semantic Chunking

Dec 1, 2024

Learn how LlamaChunk delivers fast, accurate semantic chunking for RAG—outperforming regex and embedding methods with LLM-guided document splitting.

AGI requires better retrieval, not just better LLMs

Dec 2, 2024

AGI needs more than LLMs—it needs smarter retrieval. Learn how to identify failure modes in RAG and evaluate search accuracy with ZeroEntropy’s benchmarks.

AGI requires better retrieval, not just better LLMs

Dec 2, 2024

AGI needs more than LLMs—it needs smarter retrieval. Learn how to identify failure modes in RAG and evaluate search accuracy with ZeroEntropy’s benchmarks.

LegalBench-RAG, the First Open-Source Retrieval Benchmark for the Legal Domain

Nov 29, 2024

LegalBench-RAG is the first open-source benchmark for legal RAG retrieval—6,800+ queries, 79M+ characters, human-annotated spans. Evaluate legal AI today.

LlamaChunk: A General and Cost Efficient Approach to Semantic Chunking

Dec 1, 2024

Learn how LlamaChunk delivers fast, accurate semantic chunking for RAG—outperforming regex and embedding methods with LLM-guided document splitting.

Abstract image of a dark background with blurry teal, blue, and pink gradients.

Why Evaluation Metrics for Reranking Matter in Search Quality

SHARE

Why Evaluation Metrics for Reranking Matter in Search Engine Performance

Introduction

What Is Reranking and Why Does It Happen?

Why Evaluation Metrics Are Needed

Common Metrics Used in Reranking

Precision @ K

What It Measures

Formula

Python Implementation

Real-World Example

Key Insights

Recall @ K

What It Measures

Formula

Python Implementation

Real-World Example

Key Insights

F1 Score @ K

What It Measures

Formula

Python Implementation

Real-World Example

Key Insights

Mean Reciprocal Rank (MRR)

What It Measures

Formula

Real-World Example

Key Insights

Normalized Discounted Cumulative Gain (NDCG)

What It Measures

Formula

Python Implementation

Real-World Example

Key Insights

Final Thoughts

Get started with

RELATED ARTICLES

LegalBench-RAG, the First Open-Source Retrieval Benchmark for the Legal Domain

LegalBench-RAG, the First Open-Source Retrieval Benchmark for the Legal Domain

LlamaChunk: A General and Cost Efficient Approach to Semantic Chunking

LlamaChunk: A General and Cost Efficient Approach to Semantic Chunking

AGI requires better retrieval, not just better LLMs

AGI requires better retrieval, not just better LLMs

LegalBench-RAG, the First Open-Source Retrieval Benchmark for the Legal Domain

LlamaChunk: A General and Cost Efficient Approach to Semantic Chunking