SHARE
At ZeroEntropy (YC W25), we are a team of mathematicians and competitive programmers working on making search highly accurate. We just released our latest reranker models, zerank-1 and zerank-1-small (Apache 2.0!) on HuggingFace and through our API..
In this blog, we will explain how we used the concept of chess Elo Scores to train a reranker model that outperforms every other model we tested.
Let's get started!

But first, what is a reranker?
In this section, we will briefly go through what a reranker is in retrieval systems, and why it can be useful. We wrote a full blog about this in case you are interested. If you’re already familiar with the concept of rerankers, you can skip this section.
If you’re building AI systems like RAG or AI Agents, you’re probably familiar with keyword search and semantic search.
Keyword Search (BM25): lightning-fast inverted-index lookups, perfect for exact matches when you know what you're looking for (“try except syntax Python”), but recall drops when phrasing shifts (“how to catch errors in Python”).
Semantic Search: You embed each document into high-dimensional vectors. At query time, you embed the query into a vector, and find matching results with the highest dot product (cosine similarity). This is much better at conceptual queries, as vectors are based on the semantic meaning of the content, rather than any particular keywords.
At ZeroEntropy, we combine those two methods using Reciprocal Rank Fusion to maximize recall. But recall alone isn’t enough. You might retrieve the correct answer out of millions of documents, but if it sits at position 67, your LLM, or your users, will probably never see it. This is the problem that rerankers address.
A reranker is a cross-encoder neural network that refines search results by re-scoring and reordering them based on relevance. Because the reranker sees the full text of both query and document at the same time, it is much more context-aware than first-stage methods where the document was encoded without knowledge of the specific query.
In the next few sections, we will explain how we invented a brand new pipeline based on Elo scores (yes, just like in chess) to train our state-of-the-art reranker.
The Problem Statement
Given a corpus 𝐶 of documents 𝑑₁,…,𝑑ₙ and a query 𝑞, the goal here is to output a ranking 𝑅₁,…,𝑅ₙ ordered by human relevance.
Suppose there existed an oracle function 𝑓 where 𝑓(𝑞, 𝑑) outputs a relevance score 𝑠 ∈ [0, 1], which is the relevancy between the query 𝑞 and the document 𝑑. Since this function takes in both the query and document, it is much more context-aware than first-stage methods, where the document has to be encoded independently of the query. To train such a model, the solution would be to simply sample 𝑓 millions of times, amassing a large quantity of (𝑞, 𝑑, 𝑠) triplets for training data, and now our problem becomes a regression task that a Neural Network should easily learn from!
Not quite.

Existing Solutions: Human Binary Annotations and The Plague of False Negatives
Most state-of-the-art solutions use the same approach: take human-annotated pairs (𝑞,𝑑), i.e. where the human has tagged the document as relevant to the query. We call (𝑞,𝑑) with score = 1
a "positive pair", and (𝑞,𝑑) with score = 0
a "negative pair". We can then train the model to recognize the positive pairs from the negative ones.
But how do we create negative pairs?
One solution is to sample a completely random document out of the set of all documents, with the thesis that it's probably not relevant. Obviously, that works, but it doesn’t make a very nuanced reranker. The randomly sampled negative documents are just too obvious!Another common approach is to first search for potential candidates using BM25 or vector search as the pool of candidates where to mine negatives. But that’s where the plague of false negatives occurs: When did we check that our negative was actually irrelevant? Never.
At ZeroEntropy, as our hard negative miner grew smarter it surfaced more and more false negatives, documents actually more relevant than the supposed positives. Because no dataset has humans exhaustively scanning the entire corpus, these hidden false negatives are inevitable and flood in whenever the miner improves.
That’s why we eventually decided to completely abandon the human annotations and to explore new ways of creating nuanced and exhaustive scores.
About The Need and Difficulty of Nuanced Scores
Binary 1 / 0 labels collapse the nuance in the notion of relevance of a query-document pair. We need to use the continuous range of scores from 0 to 1 to classify the different "levels" of relevancy. But nuanced judgements are hard to find.
As an illustration of the difficulty of obtaining such a relevance function 𝑓 from a human, consider the following (assigning a score in the interval [0,10]).
Take a human and ask:
Query: “Who won the Nobel Prize in Physics in 2017?”
Doc: “Gravitational waves were first observed in 2015 by LIGO.”
Asking a human to assign a relevance score to this would have high variability, depending on what they choose to catch on to. On one hand, the year is completely wrong, and this doesn't answer the question at all. However, it is somewhat related to physics. And in fact, to some physics enthusiasts, this might get a very high relevance score, given this observation was actually what the 2017 Nobel Prize in Physics was awarded for!
There's just no way you can give a query-document pair (𝑞,𝑑) and expect a reasonable and self-consistent number between 0 and 1.
I mean, c'mon, just look at what happens when you task humans to output a random number between 1 and 0.

The noise here would completely dominate the signal; any score from 0 to 9 would have some justification!
The results in our testing were clear: Prompting either humans or LLMs to output absolute scores for (𝑞,𝑑) pairs creates a very noisy training set, even when averaging scores from many humans and LLMs.
Pairwise Comparisons come to the rescue!
That said, suppose you prompted a human to answer with which of TWO documents they think is more relevant to the query:
Query: "Who won the Nobel Prize in Physics in 2017?"
Document 1: The Nobel Prize was awarded to those who discovered gravitational waves in 2015.
Document 2 Gravitational waves were first observed in September 2015, by the LIGO gravitational wave detectors.
Almost everybody would pick Document 1. The reference point of the other document, along with the question asking for a 'local' score (comparison) as opposed to a 'global' relevance score, results in an extremely high signal-to-noise ratio.
If you tried a direct comparison, some people would give Document 1 a low score because it’s missing 2017, or give Document 2 a high score if they have domain-specific knowledge, making it very unclear which document is more relevant until you do a proper pairwise comparison.
Of course, we still need scores at the end of the day, so now we need to convert an NxN matrix of pairwise comparisons into an N-dimensional vector of absolute scores. That’s where things get interesting.

Outline of Training Steps
- Triplet sampling: For each query ~$q$, retrieve a candidate list of 100 documents via BM25 keyword search and vector search. Sample random triplets $(q,d_i,d_j)$ and query an ensemble of large language models to decide which document is more relevant to~$q$.
- Pairwise reranker training: Use the labeled triplets to train a lightweight pairwise reranker $\mathrm{R}_{\mathrm{pair}}$ that estimates pairwise relevance scores for tuples $(q,d_i,d_j)$, enabling scalable inference.
- ELO rating computation: For each query~$q$, apply $\mathrm{R}_{\mathrm{pair}}$ to infer scores $$s_{ij} = \mathrm{R}_{\mathrm{pair}}(q,d_i,d_j)$$ across many pairs $(i,j)$, ensuring each document competes in approximately four pairwise comparisons. Treat these as “games” and fit ELO ratings $e_1,\dots,e_n$ to the documents, yielding their relative relevance.
- Pointwise reranker training: Train a pointwise reranker $\mathrm{R}_{\mathrm{point}}(q,d)$ on the dataset of ELO-rated documents to directly predict document relevance.
- Reinforcement learning fine-tuning: After the initial supervised training, employ reinforcement learning to further refine the reranker and enhance performance.
Pairwise Scoring and ELOs
Training a Pairwise Reranker
For every query 𝑞 in our overall data, we use a small number of pairs of documents 𝑑𝑖, 𝑑𝑗, and infer the relative preference of 𝑑𝑗 over 𝑑𝑖 given 𝑞 on an ensemble of 3 LLMs as a proxy for human annotations.
ⓘ For sanity check, we confirm that when the LLMs reach complete consensus, human annotators coincide with that consensus >96% of the time. Since SoTA rerankers often concur only 60–70% of the time, that shows that there are enormous gains to be made (random chance is only 50%…)
We then average the three LLM preferences to get an ensemble score ∈ [0,1], with 1 favoring 𝑑𝑖 and 0 favoring 𝑑𝑗.
Of course, trying to fill an NxN matrix across Q questions is way too expensive. So, we train an efficient pairwise reranker by supervised fine-tuning a small open-source model on the ensemble annotations. We use standard Binary Cross Entropy loss.
Obtaining ELO Ratings
Once we’ve got our loss function defined, it’s straightforward to run gradient descent and recover the best ELO scores. We then shift all the scores so that they sum to zero, 𝑒₁ + … + 𝑒ₙ = 0. Really, any offset would work, but zero-centering keeps things tidy, and we’ll see why shortly.
In practice, we use n = 100 documents per query and have N_Q = 100 000 queries total. Computing all n² pairwise inferences for each query would blow up computationally!
Instead we only sample O(n) matchups and still get almost the same ELOs.
Think of your documents d₁…dₙ as nodes in a graph and each inferred comparison (q, dᵢ, dⱼ) as an edge. To cover the space efficiently, we pick N = 400 inferences per query by sampling four random “cycles” (each cycle is just a closed ring of edges), which gives us enough information to fit scores that closely mirror the exhaustive approach.
Cross-Query Bias Adjustment
Training the Pointwise Model
And just like that, we've created our magic function 𝑓. Having modeled absolute relevance scores as 𝑓(𝑞, 𝑑) = elo(𝐶, 𝑞, 𝑑) + 𝑏(𝐶, 𝑞) from the candidate lists and small pairwise comparator, we now supervise fine-tune a reranker using standard mean-squared error loss, and get a model that very accurately one-shots 𝑓 in <100ms! Getting the best results required extensive ablation studies and hyperparameter search (It also involved Reinforcement Learning, which we'll discuss in a following post). Our conclusion was that fine-tuning Qwen4B and Qwen1.7B on 𝑓 produced the best rerankers, leading to zerank-1 and zerank-1-small respectively!
While cross-encoders have become standard practice for NN-based rerankers, existing state-of-the-art research on rerankers focus primarily on architectural tweaks or augmenting human-annotated data. Our training pipeline, centered around mathematically modeling query-document relevance scores, represents a unique approach to reranking, and the results speak for themselves
Try it today on HuggingFace or via our API!
RELATED ARTICLES
