Should You Use LLMs for Reranking? A Deep Dive into Pointwise, Listwise, and Cross-Encoders
Sep 5, 2025
SHARE
TLDR:
Use cheap recall first. Use a reranker (i.e. cross encoder) to reorder the top 50-100. Only run expensive LLM listwise reasoning on the top few if the use case demands it.
Pointwise LLM scoring is uncalibrated and slow. Listwise LLM ranking is more consistent, but is only viable for very small candidate lists due to context length and latency.
Pro: LLMs do also have the benefit of customizable prompt engineering.
Con: However, they output very noisy and inconsistent scores, can miss relevant documents or use the wrong document id, or can fail to assign a score to every document.
Dedicated reranker models output consistent, interpretable, and deterministic scores between 0 and 1 for how relevant a document is to a query.
The result is that LLMs are less accurate, slower, and more expensive than dedicated rerankers such as ZeroEntropy's
zerank-1
.
On 17 datasets,
zerank-1
reaches about 0.78 NDCG@10, meanwhile GPT-5-mini and nano score 0.70 NDCG@10 (Even when excluding format failures/timeouts from the LLM). Meanwhile,zerank-1
has a p50 130ms and is 10-30x cheaper than small LLMs used for rerank.-> You might think to finetune an LLM on ranking tasks in order to improve results and resolve formatting issues. And, you'd be right! That's exactly what a dedicated reranker model is.
At ZeroEntropy (YC W25), we spend a lot of time thinking about one deceptively simple question:
How do you ensure the right document appears at the top of search results?
If you’re building retrieval-heavy systems like RAG or AI Agents, you already know the story: keyword search (BM25) is fast but brittle, semantic embeddings are flexible but noisy. The missing ingredient is reranking: a second-pass model that takes a query and candidate documents and reorders them to maximize precision.
In this post, we’ll look at reranking through the lens of large language models (LLMs), why naïvely applying them often fails, and why specialized cross-encoders still remain the most practical tool in real-world pipelines.
Why Reranking Matters
First-stage retrieval (BM25 or vector search) maximizes recall. That ensures the correct answer is somewhere in the top-100 results. But unless it’s boosted to the top few positions, it will never reach the user or the LLM that consumes it.
That’s where reranking comes in: given a query–document pair, a reranker assigns a calibrated relevance score and reorders the candidates accordingly. Unlike embeddings, which encode each document in isolation, rerankers see both query and document at the same time. This context-awareness is what makes rerankers indispensable in modern retrieval pipelines.
Why use LLMs for Reranking
There are two primary ways to repurpose autoregressive LLMs for reranking:
Pointwise reranking
You prompt the LLM with a query and a single candidate document and ask it to output a relevance score. Do this independently for each candidate and then sort by score.
Listwise reranking
You feed the LLM the query along with the entire candidate set and ask it to reorder the list or select the most relevant items.
At first glance, both sound appealing, why not leverage the reasoning ability of LLMs for reranking? But each comes with serious drawbacks.
Why Pointwise Reranking with LLMs Fails in Practice
Pointwise reranking seems simple, but in practice it’s brittle:
Calibration issues: LLMs are not trained to produce scores in a fixed range like [0, 1]. You may get out-of-range values, reasoning text mixed in with numbers, or wildly inconsistent scales.
Slow and costly: You must run the LLM once per candidate. At 100 candidates per query, latency and cost balloon.
Not fine-tuned for reranking: General-purpose models (even gpt-4.1-nano or gpt-5-mini) lack the supervision needed to produce stable scores. They sometimes hallucinate explanations, add random tokens, or collapse all results into the same score.
In short, pointwise reranking with LLMs gives you the worst of both worlds: slow inference and unreliable outputs.
Why Listwise Reranking Makes More Sense
Listwise reranking aligns better with how humans judge relevance: by comparing candidates relative to each other. Given the full list, the LLM can reason globally and enforce consistency across rankings.
But there are trade-offs:
Still not trained for the task: Even listwise prompts risk random outputs—extra text, invalid indices, or reordered lists that don’t align with expectations.
Extremely expensive: Encoding all candidates into a single long prompt pushes token limits and multiplies latency. Remember that LLM attention costs O(n²), where n is the number of tokens.
Only feasible for very small candidate sets: Because of this, listwise reranking makes sense if you’re refining the top 5–10 results, not the top 100-200.
Accuracy Evaluation Across 17 Benchmarks
We evaluated multiple reranking models across MTEB, BEIR, MS MARCO, and domain-specific datasets. Below are highlights comparing ZeroEntropy’s reranker against listwise LLM-based baselines, where we limited the context to 50k tokens.
NDCG@10 (higher is better):
Average NDCG@10 across all datasets:
Model | Average NDCG@10 across all datasets |
---|---|
OpenAI text-embedding-small (baseline) | 0.61753 |
Cohere rerank-3.5 | 0.71939 |
gpt-4o-mini | 0.66319 |
gpt-4.1-mini | 0.71312 |
gpt-5-mini | 0.69800 |
gpt-5-nano | 0.71163 |
zerank-1 | 0.77667 |
The pattern is consistent: while LLMs occasionally spike on narrow tasks, cross-encoders trained for reranking outperform or match them across the board, with far lower latency and cost.
Latency and Cost Evaluation Across 17 Benchmarks
Here is a recap of p50 latency and price per million tokens for these models. Because they run at query time, the latency budget can be extremely tight: every extra millisecond counts. And when you rely on third-party LLM APIs, you also risk unpredictable delays, with random spikes of several seconds that can break the user experience.
Model | Cost per one million tokens (input tokens) | Multiplier compared to zerank-1 |
---|---|---|
Cohere rerank-3.5 | $0.050 | 2x |
gpt-4o-mini | $0.60 | 24x |
gpt-4.1-mini | $0.80 | 32x |
gpt-5-mini | $0.250 | 10x |
gpt-5-nano | $0.050 | 2x |
zerank-1 | $0.025 | 1x |
Model | p50 for 75kb input | Multiplier compared to zerank-1 |
---|---|---|
Cohere rerank-3.5 | 198.1ms | 1.5x |
gpt-4o-mini | 1090ms | 8.5x |
gpt-4.1-mini | 740ms | 5.5x |
gpt-5-mini | 2180ms | 17x |
gpt-5-nano | 1520ms | 2x |
zerank-1 | 129.7ms | 1x |
The General Wisdom: Expensive Last, Cheap First
The lesson is clear:
Always apply the most expensive techniques to the smallest number of candidates.
That’s why production systems layer retrieval like this:
Cheap first-stage retrieval (BM25, embeddings, or hybrid) to maximize recall.
Specialized cross-encoder reranker to boost precision in the top 50–100.
(Optional) Apply an expensive listwise LLM-based reasoning step only on the top handful of candidates, if the use case demands it.
Cross-encoders like our zerank-1 and zerank-1-small are trained specifically for this task: given a query and document, output a calibrated relevance score. They deliver consistency, efficiency, and accuracy that general-purpose LLMs cannot match.
Score Calibration
• The problem. In production, a reranker’s score must be comparable across queries. In our studies, pointwise LLM prompts produced score drift across prompts and candidate sets, and per-query normalizations erased signal. A score like 0.82 should mean the same thing no matter the query, prompt, or document length. With pointwise LLM prompts it does not.
• What zerank-1 does. We stop asking for absolute numbers. We collect pairwise preferences, train a small comparator to predict those cheaply, then turn many head-to-heads into a single global scale with an Elo-style model. We learn a per-query bias from cross-query matchups to handle barren versus rich queries, and fit one temperature on a held-out set so probabilities line up with observed win rates.
• Result and impact. Stable, comparable probabilities across queries. One threshold to tune, clean blending with BM25 and embeddings, and rankings that stay consistent across releases. Lower calibration error with equal or better NDCG.
Takeaways
Pointwise reranking with LLMs is impractical: uncalibrated, slow, and unreliable.
Listwise reranking is conceptually stronger but only viable for tiny candidate sets, due to cost and fragility.
Specialized rerankers like zerank-1 deliver calibrated, consistent, and scalable results—and should be your go-to tool for reranking.
Reserve expensive LLM reasoning for the very top-k documents where absolute precision matters most.
At ZeroEntropy, our mission is to push retrieval systems toward both higher accuracy and real-world deployability. Our latest reranker models (Apache 2.0 licensed) are now available on HuggingFace and via our API—try them out today.
Get started with
RELATED ARTICLES
