Improving Retrieval with ELO Scores

Jul 8, 2025

Improving Retrieval with ELO Scores
Improving Retrieval with ELO Scores
Improving Retrieval with ELO Scores
SHARE

At ZeroEntropy (YC W25), we are a team of mathematicians and competitive programmers working on making search highly accurate. We just released our latest reranker models, zerank-1 and zerank-1-small (Apache 2.0!) on HuggingFace and through our API..

In this blog, we will explain how we used the concept of chess Elo Scores to train a reranker model that outperforms every other model we tested.

Let's get started!

But first, what is a reranker?

In this section, we will briefly go through what a reranker is in retrieval systems, and why it can be useful. We wrote a full blog about this in case you are interested. If you’re already familiar with the concept of rerankers, you can skip this section.

If you’re building AI systems like RAG or AI Agents, you’re probably familiar with keyword search and semantic search.

  • Keyword Search (BM25): lightning-fast inverted-index lookups, perfect for exact matches when you know what you're looking for (“try except syntax Python”), but recall drops when phrasing shifts (“how to catch errors in Python”).

  • Semantic Search: You embed each document into high-dimensional vectors. At query time, you embed the query into a vector, and find matching results with the highest dot product (cosine similarity). This is much better at conceptual queries, as vectors are based on the semantic meaning of the content, rather than any particular keywords.

At ZeroEntropy, we combine those two methods using Reciprocal Rank Fusion to maximize recall. But recall alone isn’t enough. You might retrieve the correct answer out of millions of documents, but if it sits at position 67, your LLM, or your users, will probably never see it. This is the problem that rerankers address.

A reranker is a cross-encoder neural network that refines search results by re-scoring and reordering them based on relevance. Because the reranker sees the full text of both query and document at the same time, it is much more context-aware than first-stage methods where the document was encoded without knowledge of the specific query.

In the next few sections, we will explain how we invented a brand new pipeline based on Elo scores (yes, just like in chess) to train our state-of-the-art reranker.

The Problem Statement

Given a corpus 𝐶 of documents 𝑑₁,…,𝑑ₙ and a query 𝑞, the goal here is to output a ranking 𝑅₁,…,𝑅ₙ ordered by human relevance.

Suppose there existed an oracle function 𝑓 where 𝑓(𝑞, 𝑑) outputs a relevance score 𝑠 ∈ [0, 1], which is the relevancy between the query 𝑞 and the document 𝑑. Since this function takes in both the query and document, it is much more context-aware than first-stage methods, where the document has to be encoded independently of the query. To train such a model, the solution would be to simply sample 𝑓 millions of times, amassing a large quantity of (𝑞, 𝑑, 𝑠) triplets for training data, and now our problem becomes a regression task that a Neural Network should easily learn from!

Not quite.

Existing Solutions: Human Binary Annotations and The Plague of False Negatives

Most state-of-the-art solutions use the same approach: take human-annotated pairs (𝑞,𝑑), i.e. where the human has tagged the document as relevant to the query. We call (𝑞,𝑑) with score = 1 a "positive pair", and (𝑞,𝑑) with score = 0 a "negative pair". We can then train the model to recognize the positive pairs from the negative ones.



But how do we create negative pairs?

One solution is to sample a completely random document out of the set of all documents, with the thesis that it's probably not relevant. Obviously, that works, but it doesn’t make a very nuanced reranker. The randomly sampled negative documents are just too obvious!Another common approach is to first search for potential candidates using BM25 or vector search as the pool of candidates where to mine negatives. But that’s where the plague of false negatives occurs: When did we check that our negative was actually irrelevant? Never.

At ZeroEntropy, as our hard negative miner grew smarter it surfaced more and more false negatives, documents actually more relevant than the supposed positives. Because no dataset has humans exhaustively scanning the entire corpus, these hidden false negatives are inevitable and flood in whenever the miner improves.

That’s why we eventually decided to completely abandon the human annotations and to explore new ways of creating nuanced and exhaustive scores.

About The Need and Difficulty of Nuanced Scores

Binary 1 / 0 labels collapse the nuance in the notion of relevance of a query-document pair. We need to use the continuous range of scores from 0 to 1 to classify the different "levels" of relevancy. But nuanced judgements are hard to find.

As an illustration of the difficulty of obtaining such a relevance function 𝑓 from a human, consider the following (assigning a score in the interval [0,10]).

Take a human and ask:

Query: “Who won the Nobel Prize in Physics in 2017?”
Doc: “Gravitational waves were first observed in 2015 by LIGO.”

Asking a human to assign a relevance score to this would have high variability, depending on what they choose to catch on to. On one hand, the year is completely wrong, and this doesn't answer the question at all. However, it is somewhat related to physics. And in fact, to some physics enthusiasts, this might get a very high relevance score, given this observation was actually what the 2017 Nobel Prize in Physics was awarded for!

There's just no way you can give a query-document pair (𝑞,𝑑) and expect a reasonable and self-consistent number between 0 and 1.

I mean, c'mon, just look at what happens when you task humans to output a random number between 1 and 0.

The noise here would completely dominate the signal; any score from 0 to 9 would have some justification!

The results in our testing were clear: Prompting either humans or LLMs to output absolute scores for (𝑞,𝑑) pairs creates a very noisy training set, even when averaging scores from many humans and LLMs.

Pairwise Comparisons come to the rescue!

That said, suppose you prompted a human to answer with which of TWO documents they think is more relevant to the query:

Query: "Who won the Nobel Prize in Physics in 2017?"
Document 1: The Nobel Prize was awarded to those who discovered gravitational waves in 2015.
Document 2 Gravitational waves were first observed in September 2015, by the LIGO gravitational wave detectors.

Almost everybody would pick Document 1. The reference point of the other document, along with the question asking for a 'local' score (comparison) as opposed to a 'global' relevance score, results in an extremely high signal-to-noise ratio.

If you tried a direct comparison, some people would give Document 1 a low score because it’s missing 2017, or give Document 2 a high score if they have domain-specific knowledge, making it very unclear which document is more relevant until you do a proper pairwise comparison.

Of course, we still need scores at the end of the day, so now we need to convert an NxN matrix of pairwise comparisons into an N-dimensional vector of absolute scores. That’s where things get interesting.

Outline of Training Steps

Given this insight, we developed the following training pipeline:
  1. Triplet sampling: For each query ~$q$, retrieve a candidate list of 100 documents via BM25 keyword search and vector search. Sample random triplets $(q,d_i,d_j)$ and query an ensemble of large language models to decide which document is more relevant to~$q$.
  2. Pairwise reranker training: Use the labeled triplets to train a lightweight pairwise reranker $\mathrm{R}_{\mathrm{pair}}$ that estimates pairwise relevance scores for tuples $(q,d_i,d_j)$, enabling scalable inference.
  3. ELO rating computation: For each query~$q$, apply $\mathrm{R}_{\mathrm{pair}}$ to infer scores $$s_{ij} = \mathrm{R}_{\mathrm{pair}}(q,d_i,d_j)$$ across many pairs $(i,j)$, ensuring each document competes in approximately four pairwise comparisons. Treat these as “games” and fit ELO ratings $e_1,\dots,e_n$ to the documents, yielding their relative relevance.
  4. Pointwise reranker training: Train a pointwise reranker $\mathrm{R}_{\mathrm{point}}(q,d)$ on the dataset of ELO-rated documents to directly predict document relevance.
  5. Reinforcement learning fine-tuning: After the initial supervised training, employ reinforcement learning to further refine the reranker and enhance performance.
This design ensures high-quality ranking, based on pairwise-derived ELO targets, while cutting inference cost and latency by roughly half compared to pure pairwise scoring.

Pairwise Scoring and ELOs

Training a Pairwise Reranker

For every query 𝑞 in our overall data, we use a small number of pairs of documents 𝑑𝑖, 𝑑𝑗, and infer the relative preference of 𝑑𝑗 over 𝑑𝑖 given 𝑞 on an ensemble of 3 LLMs as a proxy for human annotations.

ⓘ For sanity check, we confirm that when the LLMs reach complete consensus, human annotators coincide with that consensus >96% of the time. Since SoTA rerankers often concur only 60–70% of the time, that shows that there are enormous gains to be made (random chance is only 50%…)

We then average the three LLM preferences to get an ensemble score ∈ [0,1], with 1 favoring 𝑑𝑖 and 0 favoring 𝑑𝑗.

Of course, trying to fill an NxN matrix across Q questions is way too expensive. So, we train an efficient pairwise reranker by supervised fine-tuning a small open-source model on the ensemble annotations. We use standard Binary Cross Entropy loss.


Obtaining ELO Ratings

Define $$ \sigma(x) = \frac{1}{1 + e^{-x}} \quad\text{(the canonical sigmoid)}, \quad \mathrm{lgt}(x) = \log\!\biggl(\frac{x}{1 - x}\biggr) \quad\text{(its inverse)}. $$ Throughout this article, $\log$ is base $e$. Given a set of $n$ “players” (documents) with ELO ratings $e_{1},\dots,e_{n}$ and a match between players $i$ and $j$, the ELO model predicts $$ P(i \text{ beats } j) = \sigma(e_{i} - e_{j}). $$ Equivalently, $$ e_{i} - e_{j} = \mathrm{lgt}\bigl(P(i \text{ beats } j)\bigr). $$ Given a set of pairwise matchups, we perform Maximum Likelihood Estimation to fit ELO ratings that best explain the observed results (see the full report for details). If $w_{ij}$ are the entries of our scoring matrix (either $R_{\mathrm{pair}}(q,d_{i},d_{j})$ if inferred, otherwise $0$), and we set $p_{i} = e^{e_{i}}$, then the negative log-likelihood (loss) is $$ \mathcal{L} = - \sum_{i,j} w_{ij}\,\log\!\biggl(\frac{p_{i}}{p_{i} + p_{j}}\biggr). $$
Define $$ \sigma(x) = \frac{1}{1 + e^{-x}} \quad\text{(the canonical sigmoid)}, \quad \mathrm{lgt}(x) = \log\!\biggl(\frac{x}{1 - x}\biggr) \quad\text{(its inverse)}. $$ Throughout this article, $\log$ is base $e$. Given a set of $n$ “players” (documents) with ELO ratings $e_{1},\dots,e_{n}$ and a match between players $i$ and $j$, the ELO model predicts $$ P(i \text{ beats } j) = \sigma(e_{i} - e_{j}). $$ Equivalently, $$ e_{i} - e_{j} = \mathrm{lgt}\bigl(P(i \text{ beats } j)\bigr). $$ Given a set of pairwise matchups, we perform Maximum Likelihood Estimation to fit ELO ratings that best explain the observed results (see the full report for details). If $w_{ij}$ are the entries of our scoring matrix (either $R_{\mathrm{pair}}(q,d_{i},d_{j})$ if inferred, otherwise $0$), and we set $p_{i} = e^{e_{i}}$, then the negative log-likelihood (loss) is $$ \mathcal{L} = - \sum_{i,j} w_{ij}\,\log\!\biggl(\frac{p_{i}}{p_{i} + p_{j}}\biggr). $$
\mathcal{L} = -\sum_{i, j}{w_{ij}\log(\frac{p_{i}}{p_{i}+p_{j}})} = \sum_{i, j}{w_{ij}\log(1+e^{e_j-e_i})} \; .
\mathcal{L} = -\sum_{i, j}{w_{ij}\log(\frac{p_{i}}{p_{i}+p_{j}})} = \sum_{i, j}{w_{ij}\log(1+e^{e_j-e_i})} \; .

Once we’ve got our loss function defined, it’s straightforward to run gradient descent and recover the best ELO scores. We then shift all the scores so that they sum to zero, 𝑒₁ + … + 𝑒ₙ = 0. Really, any offset would work, but zero-centering keeps things tidy, and we’ll see why shortly.


In practice, we use n = 100 documents per query and have N_Q = 100 000 queries total. Computing all n² pairwise inferences for each query would blow up computationally!


Instead we only sample O(n) matchups and still get almost the same ELOs.


Think of your documents d₁…dₙ as nodes in a graph and each inferred comparison (q, dᵢ, dⱼ) as an edge. To cover the space efficiently, we pick N = 400 inferences per query by sampling four random “cycles” (each cycle is just a closed ring of edges), which gives us enough information to fit scores that closely mirror the exhaustive approach.


Cross-Query Bias Adjustment

If you notice, we made an odd presumption: $e_{1} + \dots + e_{n} = 0$. \\ Given that the definition of ELO is only that $\mathbb{P}(i \text{ beats } j) = \sigma(e_{i} - e_{j})$, this “offset” has no effect on the ELO model. But you can already see the issue:
  1. If I have a query with no relevant results, I’ll assign a very high ELO to the Top 1 result, even through it’s bad (it’s the “least bad result”)
  2. If I have a query with hundreds of relevant results, I’ll assign a very low ELO to the bottom results, even though they’re still somewhat relevant
This is extremely confusing for our pointwise reranker, because in isolation it sees $(q,d,s_{\text{high}})$ even when $q$ and $d$ are irrelevant, and it then sees a triplet $(q,d,s_{\text{low}})$ even when $q$ and $d$ are relevant. This introduces noise, the very thing we wanted to avoid! Of course, the only free parameter in the ELO model is this bias term, so we just need a way to calculate a per-query bias $b_q$ that “shifts” the ELOs based on the absolute relevancy of the candidate documents. The math gets a bit complicated here, but the gist is:
  1. We introduce cross-query pairs, i.e. two pairs $P_{i,j}\!=\!(q_i,d_j)$ and $P_{a,b}\!=\!(q_a,d_b)$, where the pairwise comparison model must choose "Which pair has its query more relevant to its document". This is inherently "apples-to-oranges", so it's much more noisy, and requires custom prompt engineering for the Ensemble to reach consensus often. When $q_i = q_a$,we use the same prompt engineering as the standard "same-query different-document" pairwise comparison.
  2. We introduce this training data into the tiny pairwise comparator model, so that it can learn to distinguish both intra-query and inter-query document pairs, cheaply mimicking the large Ensemble
  3. We inference millions of cross-query pairs P₁, P₂, and then feed that data into our ELO calculation. Again using Maximum Likelihood Estimation in order to calculate the "ideal bias" for each query, such that the cross-query results are explained using the formula: $$\mathbb{P}\bigl(P_{i,j} \text{ beats } P_{a,b}\bigr)= \sigma\!\bigl((b_i + e_j) - (b_a + e_b)\bigr)$$
If you notice, we made an odd presumption: $e_{1} + \dots + e_{n} = 0$. \\ Given that the definition of ELO is only that $\mathbb{P}(i \text{ beats } j) = \sigma(e_{i} - e_{j})$, this “offset” has no effect on the ELO model. But you can already see the issue:
  1. If I have a query with no relevant results, I’ll assign a very high ELO to the Top 1 result, even through it’s bad (it’s the “least bad result”)
  2. If I have a query with hundreds of relevant results, I’ll assign a very low ELO to the bottom results, even though they’re still somewhat relevant
This is extremely confusing for our pointwise reranker, because in isolation it sees $(q,d,s_{\text{high}})$ even when $q$ and $d$ are irrelevant, and it then sees a triplet $(q,d,s_{\text{low}})$ even when $q$ and $d$ are relevant. This introduces noise, the very thing we wanted to avoid! Of course, the only free parameter in the ELO model is this bias term, so we just need a way to calculate a per-query bias $b_q$ that “shifts” the ELOs based on the absolute relevancy of the candidate documents. The math gets a bit complicated here, but the gist is:
  1. We introduce cross-query pairs, i.e. two pairs $P_{i,j}\!=\!(q_i,d_j)$ and $P_{a,b}\!=\!(q_a,d_b)$, where the pairwise comparison model must choose "Which pair has its query more relevant to its document". This is inherently "apples-to-oranges", so it's much more noisy, and requires custom prompt engineering for the Ensemble to reach consensus often. When $q_i = q_a$,we use the same prompt engineering as the standard "same-query different-document" pairwise comparison.
  2. We introduce this training data into the tiny pairwise comparator model, so that it can learn to distinguish both intra-query and inter-query document pairs, cheaply mimicking the large Ensemble
  3. We inference millions of cross-query pairs P₁, P₂, and then feed that data into our ELO calculation. Again using Maximum Likelihood Estimation in order to calculate the "ideal bias" for each query, such that the cross-query results are explained using the formula: $$\mathbb{P}\bigl(P_{i,j} \text{ beats } P_{a,b}\bigr)= \sigma\!\bigl((b_i + e_j) - (b_a + e_b)\bigr)$$

Training the Pointwise Model

And just like that, we've created our magic function 𝑓. Having modeled absolute relevance scores as 𝑓(𝑞, 𝑑) = elo(𝐶, 𝑞, 𝑑) + 𝑏(𝐶, 𝑞) from the candidate lists and small pairwise comparator, we now supervise fine-tune a reranker using standard mean-squared error loss, and get a model that very accurately one-shots 𝑓 in <100ms! Getting the best results required extensive ablation studies and hyperparameter search (It also involved Reinforcement Learning, which we'll discuss in a following post). Our conclusion was that fine-tuning Qwen4B and Qwen1.7B on 𝑓 produced the best rerankers, leading to zerank-1 and zerank-1-small respectively!


While cross-encoders have become standard practice for NN-based rerankers, existing state-of-the-art research on rerankers focus primarily on architectural tweaks or augmenting human-annotated data. Our training pipeline, centered around mathematically modeling query-document relevance scores, represents a unique approach to reranking, and the results speak for themselves


Try it today on HuggingFace or via our API!

Get started with

Our retrieval engine runs autonomously with the 

accuracy of a human-curated system.

GitHub

Discord

Slack

Enterprise

Contact us for a custom enterprise solution with custom pricing

Get started with

Our retrieval engine runs autonomously with the 

accuracy of a human-curated system.

GitHub

Discord

Slack

Enterprise

Contact us for a custom enterprise solution with custom pricing

Get started with

Our retrieval engine runs autonomously with the 

accuracy of a human-curated system.

GitHub

Discord

Slack

Enterprise

Contact us for a custom enterprise solution with custom pricing