Statistical Significance in Retrieval Evals

Q: Bootstrap or permutation — which should I use?

Permutation test if you have paired query results from two systems on the same queries (almost always the case). It's exact under the null hypothesis of no difference. Bootstrap when you want a confidence interval on the metric itself, not just a p-value. In practice, both run in milliseconds; report both.

Q: How many queries do I need for reliable significance?

Rough rule: for NDCG@10 differences of 0.01-0.02 to be detectable at p<0.05, you need ~500-1000 queries. For 0.005 differences (typical leaderboard noise), 5000+. Most BEIR sub-benchmarks have 50-1000 — so single-dataset claims are usually underpowered. Average across BEIR datasets to gain power.

Q: Why is a paired test usually correct here?

Both retrievers see the same queries — there's natural pairing. Paired tests (paired bootstrap, paired permutation, paired t-test on per-query NDCG) cancel out per-query difficulty variance, dramatically improving power. An unpaired test treats your two systems as independent samples and wastes most of your data.

Also known as: significance testing, bootstrap testing, permutation test

TL;DR

Retrieval evals report metrics like NDCG@10 averaged across queries — but each query is one sample, and most public benchmarks have hundreds, not thousands. A '+0.5 NDCG' difference is often noise.

A retrieval metric like NDCG@10 averaged over 100 queries is one number with a real standard error. Two retrievers reporting 0.421 and 0.426 NDCG@10 may differ at p around 0.5 — i.e., not at all in any meaningful sense. Eval reports without significance tests are reports of noise as if it were signal, and the fix is a fifty-line bootstrap.

The shape of retrieval-eval noise

For a benchmark of N queries, each query produces a per-query metric (NDCG@10 for that query). The reported headline is the mean. The noise is the standard error of that mean — roughly , where is the per-query standard deviation of the metric.

For NDCG@10, per-query SD is typically 0.2-0.4 across BEIR datasets. With N=100 queries, that’s a standard error of 0.02-0.04. A “+0.01 NDCG” improvement is well inside noise; “+0.05” is borderline; “+0.10” is real.

The right tests

Paired permutation test

The cleanest answer for “is system A better than system B on this query set?” Both systems run on the same N queries. For each query, compute the difference in per-query metric. The null hypothesis is “the sign of the difference is random.” Shuffle signs B times (B=10,000 typical), compute mean difference each time, count how often the shuffled mean exceeds the observed mean. That’s your p-value.

Paired bootstrap

Resample the N queries with replacement, compute mean metric for each system on the resample, take the difference, repeat. The bootstrap distribution of differences gives you a confidence interval directly. 95% CI excluding zero ≡ .

Paired t-test on per-query metrics

Old-school but cheap. The per-query metric differences are approximately Gaussian for large N; a paired t-test on them gives a p-value. Less robust to outliers than permutation/bootstrap, but agrees with them in the easy cases.

An unpaired test treats System A’s per-query scores and System B’s per-query scores as two independent samples from two populations. But they’re not independent — both systems see the same queries, and per-query difficulty is a massive variance term that paired tests cancel.

Concretely: imagine System A and B both score 0.5 NDCG on every query, except B scores 0.05 higher on every single query. Paired test: clearly significant (every paired difference is +0.05). Unpaired test: pooled SD includes per-query difficulty variance, which dwarfs the 0.05 systematic improvement, p≈0.5.

The paired structure is free signal. Always use it.

A concrete recipe

Compute per-query metrics

For each query in your eval set, compute NDCG@10 (or whatever metric) for both systems. Result: two arrays of length N.

Run paired permutation test

import numpy as np

def paired_perm_test(a, b, n_iter=10000):
    diffs = a - b
    obs = diffs.mean()
    perm_means = np.array([
        (diffs * np.random.choice([-1,1], len(diffs))).mean()
        for _ in range(n_iter)
    ])
    return (np.abs(perm_means) >= np.abs(obs)).mean()

This returns the two-sided p-value.

Report with confidence interval

Use bootstrap to get a 95% CI on the difference. Report: System A: 0.482 NDCG@10. System B: 0.467 NDCG@10. Δ=+0.015, 95% CI [+0.008, +0.022], p<0.001 (paired permutation, n=10,000).

That sentence is shippable. “System A wins by 0.015” without the CI and p-value is not.

Multiple-testing pitfalls

Running 18 BEIR datasets and reporting “we win on 12 of 18” without multiple-testing correction is meaningless — under the null, you’d expect to win 9 of 18 by chance. Use Bonferroni correction (divide your significance threshold by 18) or a per-dataset Wilson interval. The right report: “we win significantly on K of 18 after Bonferroni correction” — that K is much smaller than the naive count.

The same pitfall hits ablations. If you try 10 hyperparameter variants and report the best, you’ve implicitly multiple-tested. Hold out a final test set for the single chosen configuration.

When significance doesn’t matter

For shipping decisions, you sometimes care about effect size more than p-values — a statistically significant 0.001 NDCG improvement is unshippable; a marginal-significance 0.05 improvement might be the right call. The discipline is that you should know the p-value and also know the effect size, then make the call deliberately.

Pair this discipline with drift detection in production: yesterday’s significant lift can become tomorrow’s regression as the query distribution shifts.

Go further

Bootstrap or permutation — which should I use?

Permutation test if you have paired query results from two systems on the same queries (almost always the case). It's exact under the null hypothesis of no difference. Bootstrap when you want a confidence interval on the metric itself, not just a p-value. In practice, both run in milliseconds; report both.

NDCG@k

How many queries do I need for reliable significance?

Rough rule: for NDCG@10 differences of 0.01-0.02 to be detectable at p<0.05, you need ~500-1000 queries. For 0.005 differences (typical leaderboard noise), 5000+. Most BEIR sub-benchmarks have 50-1000 — so single-dataset claims are usually underpowered. Average across BEIR datasets to gain power.

BEIR MS MARCO

Why is a paired test usually correct here?

Both retrievers see the same queries — there's natural pairing. Paired tests (paired bootstrap, paired permutation, paired t-test on per-query NDCG) cancel out per-query difficulty variance, dramatically improving power. An unpaired test treats your two systems as independent samples and wastes most of your data.

MRR

← All concepts

The best AI teams build with ZeroEntropy models

Book Demo View docs