Improving Retrieval with ELO Scores

Jul 8, 2025

Introducing ZeroEntropy's Rerankers

At ZeroEntropy (YC W25) we just released our latest retrieval models zerank-1 and zerank-1-small on HuggingFace. Now, in order to further support the open-source community, we're releasing a very detailed breakdown of our entire training pipeline.

We're doing this for several reasons,

Showing how to train and finetune your own retrieval models, using a system that's 10x better than existing research on the topic.
Training our own model using this pipeline, and offering these models on huggingface and through our API, and beating SoTA eval results on public datasets, outperforming models that were trained by multi-billion dollar companies.
Going through what it's like to invent entirely new training pipelines and RL methods, and solving entirely novel problems in mathematics and computer science, from the perspective of former quants, competitive programmers, and mathematicians from IOI & IMO.

Let's get started!

The Problem Statement

The fundamental problem of information retrieval is: Given a corpus C of documents d₁,…,dₙ and a query q, the goal is to output a ranking R₁,…,Rₙ of the documents that is ordered by human-determined relevance between the documents and queries.

One solution to the problem of retrieval is a reranker. For context, a reranker is a neural network that takes in a query: str and a document :str, and outputs a score: float between 0.0 and 1.0. Because the model sees the full text of both query and document at the same time, it is much more context-aware than first-stage methods (like BM25 or Vector Search) where the document has to be encoded independently the query itself. The model is called a reranker because exhaustively scanning the corpus is computationally infeasible, so we must use first-stage methods to find a candidate list of potential answers, before then "reranking" them with such a model.

But, how do we train the reranker?

Suppose there existed an oracle function f where f(q,d) outputs a relevance score s ∈ [0,1], which is the relevancy between the query q and the document d. Then, the solution would be to simply sample f millions of times, amassing a large quantity of (q,d,s) triplets for training data, and now our problem becomes a regression task that a Neural Network should easily learn from!

Not quite.

Motivation

Existing Solutions

If you research existing SoTA solutions on arxiv, most use the same approach: Take human-annotated pairs (q,d) (i.e. where the human has confirmed that the document is relevant to the query), and train the model to recognize those as a score of 1. Now begs the question, what has an annotation of 0? One solution is to sample a completely random document out of the set of all documents, with the thesis that it's probably not relevant. We call (q,d) with score = 1 a "positive pair", and (q,d) with score = 0 a "negative pair".

Obviously, that works. However, you won't get a very nuanced reranker. It will distinguish very obviously relevant documents from completely irrelevant documents, but that's it. That's because the randomly sampled "negative" documents are "too easy" - they're usually very obviously irrelevant. So, SoTA models use BM25 and Vector Search to find "hard negatives", i.e. more relevant than completely random, but still negative. You can find numerous resources online to "mine hard negatives" using BM25 and embedding search, and finetuning on your data with this method can create very good results!

The plague of false negatives

In the previous sentence, notice the phrase "but still negative". When did we check that our negative was actually irrelevant? Never. At ZeroEntropy we wanted to make the best reranker by mining the hardest negatives, but as we tried to squeeze performance with techniques such as ensemble-of-embeddings and LLM-as-a-reranker, that pesky phrase "but still negative" becomes a big issue. As you improve your "hard negative miner", your dataset becomes plagued with false negatives. I.e., documents that are legitimately more relevant than your human-annotated positive pair. We tried to find datasets that avoid this issue, but it's impossible, there are no datasets where humans exhaustively scanned the entire corpus (it's way too much work), so false negatives are always lurking in the corpus, showing up in droves when you improve your "miner".

The interesting part is, when you make your "hard negative miner" exceptionally strong (i.e. calling large LLMs), the resulting model is unable to learn anything at all. It will give up and output "0.5" for everything since it has no idea. You end up drawing a Laffer curve, where you want your miner to be "good", but not "too good".

Eventually, in our search for ever-higher performance, we had to abandon the human annotations - they weren't good enough or exhaustive enough. If there are better positive pairs in the corpus, we need to use those instead, leaving 2nd best as the "negative".

Nuanced scores

But, we can't give 2nd place a score of 0, it's often also very relevant, just less relevant than 1st place. Trying to carefully distinguish the quality of 1st place, 2nd place, 3rd place, to 20th place (The exact task of a reranker), you realize that clean binary annotations 1 / 0 aren't enough. You need to use the continuous range of scores from 0 to 1 to classify the different "levels" of relevancy. But nuanced judgements are hard to find.

As an illustration of the difficulty of obtaining such a relevance function f from a human, consider the following (assigning a score in the interval [0,10]):

Query: "Who won the Nobel Prize in Physics in 2017?"
Document: Gravitational waves were first observed in September 2015, by the LIGO gravitational wave detectors.

Asking a human to assign a relevance score to this would have high variability, depending on what they choose to catch on to. On one hand, the year is completely wrong, and this doesn't answer the question at all. However, it is somewhat related to physics. And in fact, to some physics enthusiasts, this might get a very high relevance score, given this observation was actually what the 2017 Nobel Prize in Physics was awarded for!

The noise here would completely dominate the signal; any score from 0 to 9 would have some justification!

The results in our testing were clear: Prompting either humans or LLM's to output absolute scores for (q,d) pairs, creates a very noisy training set, even when averaging scores from many humans and LLM's.

Pairwise Comparisons come to the rescue!

That said, suppose you prompted a human to answer with which of TWO documents they think is more relevant to the query:

Query: "Who won the Nobel Prize in Physics in 2017?"
Document 1: Gravitational waves were first observed in September 2015, by the LIGO gravitational wave detectors.
Document 2: My physics teacher wished he won the Nobel Prize last year.

Most humans would pick Document 1, especially if they have access to web search when answering the question. The reference point of the other document, along with the question asking for a 'local' score (comparison) as opposed to a 'global' relevance score, results in an extremely high signal-to-noise ratio.

Outline of Training Steps

Given this insight, we developed the following training pipeline:

We take for each query q, a candidate list of 100 documents (Retrieved via BM25 Keyword Search and Vector Search). Then, sample random triplets (q, dᵢ, dⱼ), asking an ensemble of large LLM's which document is more relevant to the query.

We use this data to train a small and cheap pairwise reranker R_pair to model pairwise relevance scores given a tuple (q, d₁, d₂). This allows us to scale into the millions of inferences.

Using R_pair, for a given query q, we infer scores sᵢⱼ = R_pair(q, dᵢ, dⱼ) for many pairs (i, j) (Ensuring each document "battles" at least ~4 other documents). Treating these results as "chess games", we fit Elo ratings e₁,…,eₙ to these documents, intended to serve as how relevant each document is to the query, relative to the overall candidate list.

We train our pointwise reranker R_point(q, d) to approximate this dataset of Elo-scored documents.

After training our first version of the model, we use Reinforcement Learning to generate a significantly better reranker.

Pairwise Scoring and Elos

Training a Pairwise Reranker

For every query q in our overall data, we use a small number of pairs of documents dᵢ, dⱼ, and infer the relative preference of dⱼ over dᵢ given q on an ensemble of 3 LLMs as a proxy for human annotations.

ⓘ For sanity check, we confirm that when the LLMs reach complete consensus, human annotators coincide with that consensus >96% of the time. Since SoTA rerankers often concur only 60-70% of the time, that shows that there are enormous gains to be made.

We then average the three LLM preferences to get an ensemble score ∈ [0,1], with 1 favoring dᵢ and 0 favoring dⱼ.

We then train an efficient pairwise reranker by supervised fine-tuning a small open-source model on the ensemble annotations. We use standard Binary Cross Entropy loss.

Obtaining Elo Ratings

Define σ(x) = 1 / (1 + e⁻ˣ) (the canonical sigmoid), and let lgt(x) = log(x / (1 − x)) be the inverse of this. Throughout this article, log is base e.

Given a set of n "players" (documents) with Elo ratings e₁,…,eₙ and a match between players i and j, the Elo model expects that the probability i beats j is σ(eᵢ − eⱼ). Similarly, we would expect eᵢ − eⱼ to be lgt(P(i beats j)). Given a set of pairwise matchups, we then performed Maximum Likelihood Estimation to fit Elo ratings that would have been the most likely explanations for the results observed; see the full report for more detail. If wᵢⱼ are the entries of our scoring matrix (R_pair(q, dᵢ, dⱼ) if inferred, otherwise simply 0), and pᵢ = eᵉⁱ, the negative log-likelihood (loss) is:

L = −∑ᵢ,ⱼ wᵢⱼ log(pᵢ / (pᵢ + pⱼ)) = ∑ᵢ,ⱼ wᵢⱼ log(1 + eᵉⱼ − ᵉᵢ)

We can then perform gradient descent with this loss to get optimal Elo scores. We apply an offset so that e₁ + … + eₙ = 0, although we could technically choose any offset. (This becomes relevant later)

Practically speaking, we use n = 100 here for the number of documents per query, with N_Q = 100,000 as the total number of queries; as a result, a dense inference on all n² pairs per query is far too costly. It is essential that we infer only O(n) pairs while still fitting Elos that are close to what we would originally observe. If we consider the graph with documents d₁,…,dₙ as the vertices and an inference on (q, dᵢ, dⱼ) as an edge, we require some key properties of this graph (see the full report!), and we ultimately used N = 400 inferences by picking 4 randomly-chosen cycles (a "cycle" is a graph such that the edges form a single closed ring) to sample for each query.

Cross-Query Bias Adjustment

If you notice, we made an odd presumption. That e₁ + … + eₙ = 0. Given that the definition of Elo is only that P(i beats j) = σ(eᵢ − eⱼ), this "offset" doesn't have any effect on the Elo model. But, you can already see the issue:

If I have a query with no relevant results, I'll assign a very high Elo to the Top 1 result, even though it's bad (It's the "least bad result")
If I have a query with hundreds of relevant results, I'll assign a very low Elo to the bottom results, even though they're still somewhat relevant.

This is extremely confusing for our pointwise reranker, because in isolation, it sees (q, d, s_high) even when q and d are irrelevant, and it then sees a triplet (q, d, s_low) even when q and d are relevant. This introduces noise, the very thing we wanted to avoid!

Of course, the only free parameter in the Elo model is this bias term, so we just need a way to calculate a per-query bias b_q that "shifts" the Elos based on the "absolute relevancy" of the candidate documents.

The math gets a bit complicated here, but the gist is:

We introduce cross-query pairs, i.e. two pairs Pᵢⱼ = (qᵢ, dⱼ), Pₐᵦ = (qₐ, dᵦ), where the pairwise comparison model must choose "Which pair has its query more relevant to its document". This is inherently "apples-to-oranges", so it's much more noisy, and requires custom prompt engineering for the Ensemble to reach consensus often. When qᵢ = qₐ, we use the same prompt engineering as the standard "same-query different-document" pairwise comparison.
We introduce this training data into the tiny pairwise comparator model, so that it can learn to distinguish both intra-query and inter-query document pairs, cheaply mimicking the large Ensemble.
We inference millions of cross-query pairs P₁, P₂, and then feed that data into our Elo calculation. Again using Maximum Likelihood Estimation in order to calculate the "ideal bias" for each query, such that the cross-query results are explained using the formula:

P(Pᵢⱼ beats Pₐᵦ) = σ((bᵢ + eⱼ) − (bₐ + eᵦ))

Training the Pointwise Model

And just like that, we've created our magic function f. Having modeled absolute relevance scores as:

f(q, d) = elo(C, q, d) + b(C, q)

...from the candidate lists and small pairwise comparator, we now supervise fine-tune a reranker using standard mean-squared error loss, and get a model that very accurately one-shots f in <100ms! Getting the best results required extensive ablation studies and hyperparameter search (It also involved Reinforcement Learning, which we'll discuss in a following post). Our conclusion was that fine-tuning Qwen4B and Qwen1.7B on f produced the best rerankers, leading to zerank-1 and zerank-1-small respectively!

While cross-encoders have become standard practice for NN-based rerankers, existing SoTA research on rerankers focus primarily on architectural tweaks or augmenting human-annotated data. Our training pipeline, centered around mathematically modeling query-document relevance scores, represents a unique approach to reranking, and the results speak for themselves.

Try it today on HuggingFace or via our API!

LegalBench-RAG, the First Open-Source Retrieval Benchmark for the Legal Domain

Nov 29, 2024