✨ Join the Context Engineers Discord community for an exclusive talk with the ZeroEntropy founders this Friday!

Should You Use LLMs for Reranking? A Deep Dive into Pointwise, Listwise, and Cross-Encoders

Sep 5, 2025

TLDR:
Use cheap recall first. Use a reranker (i.e. cross encoder) to reorder the top 50-100. Only run expensive LLM listwise reasoning on the top few if the use case demands it.
Pointwise LLM scoring is uncalibrated and slow. Listwise LLM ranking is more consistent, but is only viable for very small candidate lists due to context length and latency.
Pro: LLMs do also have the benefit of customizable prompt engineering.
Con: However, they output very noisy and inconsistent scores, can miss relevant documents or use the wrong document id, or can fail to assign a score to every document.
Dedicated reranker models output consistent, interpretable, and deterministic scores between 0 and 1 for how relevant a document is to a query.
The result is that LLMs are less accurate, slower, and more expensive than dedicated rerankers such as ZeroEntropy's zerank-1.
On 17 datasets, zerank-1 reaches about 0.78 NDCG@10, meanwhile GPT-5-mini and nano score 0.70 NDCG@10 (Even when excluding format failures/timeouts from the LLM). Meanwhile, zerank-1 has a p50 130ms and is 10-30x cheaper than small LLMs used for rerank.
-> You might think to finetune an LLM on ranking tasks in order to improve results and resolve formatting issues. And, you'd be right! That's exactly what a dedicated reranker model is.

At ZeroEntropy (YC W25), we spend a lot of time thinking about one deceptively simple question:

How do you ensure the right document appears at the top of search results?

If you’re building retrieval-heavy systems like RAG or AI Agents, you already know the story: keyword search (BM25) is fast but brittle, semantic embeddings are flexible but noisy. The missing ingredient is reranking: a second-pass model that takes a query and candidate documents and reorders them to maximize precision.

In this post, we’ll look at reranking through the lens of large language models (LLMs), why naïvely applying them often fails, and why specialized cross-encoders still remain the most practical tool in real-world pipelines.

Why Reranking Matters

First-stage retrieval (BM25 or vector search) maximizes recall. That ensures the correct answer is somewhere in the top-100 results. But unless it’s boosted to the top few positions, it will never reach the user or the LLM that consumes it.

That’s where reranking comes in: given a query–document pair, a reranker assigns a calibrated relevance score and reorders the candidates accordingly. Unlike embeddings, which encode each document in isolation, rerankers see both query and document at the same time. This context-awareness is what makes rerankers indispensable in modern retrieval pipelines.

Why use LLMs for Reranking

There are two primary ways to repurpose autoregressive LLMs for reranking:

Pointwise reranking
You prompt the LLM with a query and a single candidate document and ask it to output a relevance score. Do this independently for each candidate and then sort by score.
Listwise reranking
You feed the LLM the query along with the entire candidate set and ask it to reorder the list or select the most relevant items.

At first glance, both sound appealing, why not leverage the reasoning ability of LLMs for reranking? But each comes with serious drawbacks.

Why Pointwise Reranking with LLMs Fails in Practice

Pointwise reranking seems simple, but in practice it’s brittle:

Calibration issues: LLMs are not trained to produce scores in a fixed range like [0, 1]. You may get out-of-range values, reasoning text mixed in with numbers, or wildly inconsistent scales.
Slow and costly: You must run the LLM once per candidate. At 100 candidates per query, latency and cost balloon.
Not fine-tuned for reranking: General-purpose models (even gpt-4.1-nano or gpt-5-mini) lack the supervision needed to produce stable scores. They sometimes hallucinate explanations, add random tokens, or collapse all results into the same score.

In short, pointwise reranking with LLMs gives you the worst of both worlds: slow inference and unreliable outputs.

Why Listwise Reranking Makes More Sense

Listwise reranking aligns better with how humans judge relevance: by comparing candidates relative to each other. Given the full list, the LLM can reason globally and enforce consistency across rankings.

But there are trade-offs:

Still not trained for the task: Even listwise prompts risk random outputs—extra text, invalid indices, or reordered lists that don’t align with expectations.
Extremely expensive: Encoding all candidates into a single long prompt pushes token limits and multiplies latency. Remember that LLM attention costs O(n²), where n is the number of tokens.
Only feasible for very small candidate sets: Because of this, listwise reranking makes sense if you’re refining the top 5–10 results, not the top 100-200.

Accuracy Evaluation Across 17 Benchmarks

We evaluated multiple reranking models across MTEB, BEIR, MS MARCO, and domain-specific datasets. Below are highlights comparing ZeroEntropy’s reranker against listwise LLM-based baselines, where we limited the context to 50k tokens.

NDCG@10 (higher is better):

............... Finance ...............
• openai_small 0.83113 ± 0.01375 
• cohere 0.77950 ± 0.01600
• gpt-4o-mini 0.85127 ± 0.01331 
• gpt-4.1-mini 0.87215 ± 0.01369 
• gpt-5-mini 0.84039 ± 0.01406
• gpt-5-nano 0.83590 ± 0.01551
• zeroentropy-large 0.85130 ± 0.01415 

................. Legal ................
• openai_small 0.80430 ± 0.01106 
• cohere 0.86830 ± 0.00960 
• gpt-4o-mini 0.83868 ± 0.01058 
• gpt-4.1-mini 0.85756 ± 0.01070
• gpt-5-mini 0.85962 ± 0.01015
• gpt-5-nano 0.82983 ± 0.01159
• zeroentropy-large 0.89705 ± 0.00851

.................. Code .................
• openai_small 0.74589 ± 0.02607
• cohere 0.90520 ± 0.01578
• gpt-4o-mini 0.86173 ± 0.02079
• gpt-4.1-mini 0.95349 ± 0.01490
• gpt-5-mini 0.96830 ± 0.00985
• gpt-5-nano 0.90037 ± 0.02018 
• zeroentropy-large 0.97054 ± 0.00736

.................. Biomed ...............
• openai_small 0.75608 ± 0.01480
• cohere 0.87319 ± 0.01092
• gpt-4o-mini 0.87285 ± 0.01217
• gpt-4.1-mini 0.90664 ± 0.01383
• gpt-5-mini 0.90755 ± 0.01055
• gpt-5-nano 0.89926 ± 0.01449
• zeroentropy-large 0.93283 ± 0.00963

................ GeneralQA ...............
• openai_small 0.48821 ± 0.01977
• cohere 0.60654 ± 0.02073
• gpt-4o-mini 0.54064 ± 0.02107
• gpt-4.1-mini 0.63403 ± 0.04691
• gpt-5-mini 0.57998 ± 0.02329
• gpt-5-nano 0.72253 ± 0.04160
• zeroentropy-large 0.84780 ± 0.01320

.............. Summarization .............
• openai_small 0.06687 ± 0.01163
• cohere 0.38622 ± 0.02703
• gpt-4o-mini 0.07144 ± 0.01258
• gpt-4.1-mini 0.07305 ± 0.01291
• gpt-5-mini 0.07002 ± 0.01227
• gpt-5-nano 0.07129 ± 0.01262
• zeroentropy-large 0.34115 ± 0.02575

Average NDCG@10 across all datasets:

Model	Average NDCG@10 across all datasets
OpenAI text-embedding-small (baseline)	0.61753
Cohere rerank-3.5	0.71939
gpt-4o-mini	0.66319
gpt-4.1-mini	0.71312
gpt-5-mini	0.69800
gpt-5-nano	0.71163
zerank-1	0.77667

The pattern is consistent: while LLMs occasionally spike on narrow tasks, cross-encoders trained for reranking outperform or match them across the board, with far lower latency and cost.

Latency and Cost Evaluation Across 17 Benchmarks

Here is a recap of p50 latency and price per million tokens for these models. Because they run at query time, the latency budget can be extremely tight: every extra millisecond counts. And when you rely on third-party LLM APIs, you also risk unpredictable delays, with random spikes of several seconds that can break the user experience.

Model	Cost per one million tokens (input tokens)	Multiplier compared to zerank-1
Cohere rerank-3.5	$0.050	2x
gpt-4o-mini	$0.60	24x
gpt-4.1-mini	$0.80	32x
gpt-5-mini	$0.250	10x
gpt-5-nano	$0.050	2x
zerank-1	$0.025	1x

Model	p50 for 75kb input	Multiplier compared to zerank-1
Cohere rerank-3.5	198.1ms	1.5x
gpt-4o-mini	1090ms	8.5x
gpt-4.1-mini	740ms	5.5x
gpt-5-mini	2180ms	17x
gpt-5-nano	1520ms	2x
zerank-1	129.7ms	1x

The General Wisdom: Expensive Last, Cheap First

The lesson is clear:

Always apply the most expensive techniques to the smallest number of candidates.

That’s why production systems layer retrieval like this:

Cheap first-stage retrieval (BM25, embeddings, or hybrid) to maximize recall.
Specialized cross-encoder reranker to boost precision in the top 50–100.
(Optional) Apply an expensive listwise LLM-based reasoning step only on the top handful of candidates, if the use case demands it.

Cross-encoders like our zerank-1 and zerank-1-small are trained specifically for this task: given a query and document, output a calibrated relevance score. They deliver consistency, efficiency, and accuracy that general-purpose LLMs cannot match.

Score Calibration

• The problem. In production, a reranker’s score must be comparable across queries. In our studies, pointwise LLM prompts produced score drift across prompts and candidate sets, and per-query normalizations erased signal. A score like 0.82 should mean the same thing no matter the query, prompt, or document length. With pointwise LLM prompts it does not.

• What zerank-1 does. We stop asking for absolute numbers. We collect pairwise preferences, train a small comparator to predict those cheaply, then turn many head-to-heads into a single global scale with an Elo-style model. We learn a per-query bias from cross-query matchups to handle barren versus rich queries, and fit one temperature on a held-out set so probabilities line up with observed win rates.

• Result and impact. Stable, comparable probabilities across queries. One threshold to tune, clean blending with BM25 and embeddings, and rankings that stay consistent across releases. Lower calibration error with equal or better NDCG.

Takeaways

Pointwise reranking with LLMs is impractical: uncalibrated, slow, and unreliable.
Listwise reranking is conceptually stronger but only viable for tiny candidate sets, due to cost and fragility.
Specialized rerankers like zerank-1 deliver calibrated, consistent, and scalable results—and should be your go-to tool for reranking.
Reserve expensive LLM reasoning for the very top-k documents where absolute precision matters most.

At ZeroEntropy, our mission is to push retrieval systems toward both higher accuracy and real-world deployability. Our latest reranker models (Apache 2.0 licensed) are now available on HuggingFace and via our API—try them out today.

Get started with

Our retrieval engine runs autonomously with the

accuracy of a human-curated system.

Our retrieval engine runs autonomously with the

accuracy of a human-curated system.

Our retrieval engine runs autonomously with the

accuracy of a human-curated system.

Start Now

View Docs

GitHub

Discord

Slack

Enterprise

LegalBench-RAG, the First Open-Source Retrieval Benchmark for the Legal Domain

Nov 29, 2024

LegalBench-RAG is the first open-source benchmark for legal RAG retrieval—6,800+ queries, 79M+ characters, human-annotated spans. Evaluate legal AI today.

LegalBench-RAG, the First Open-Source Retrieval Benchmark for the Legal Domain

Nov 29, 2024

LegalBench-RAG is the first open-source benchmark for legal RAG retrieval—6,800+ queries, 79M+ characters, human-annotated spans. Evaluate legal AI today.

LlamaChunk: A General and Cost Efficient Approach to Semantic Chunking

Dec 1, 2024

Learn how LlamaChunk delivers fast, accurate semantic chunking for RAG—outperforming regex and embedding methods with LLM-guided document splitting.

LlamaChunk: A General and Cost Efficient Approach to Semantic Chunking

Dec 1, 2024

Learn how LlamaChunk delivers fast, accurate semantic chunking for RAG—outperforming regex and embedding methods with LLM-guided document splitting.

AGI requires better retrieval, not just better LLMs

Dec 2, 2024

AGI needs more than LLMs—it needs smarter retrieval. Learn how to identify failure modes in RAG and evaluate search accuracy with ZeroEntropy’s benchmarks.

AGI requires better retrieval, not just better LLMs

Dec 2, 2024

AGI needs more than LLMs—it needs smarter retrieval. Learn how to identify failure modes in RAG and evaluate search accuracy with ZeroEntropy’s benchmarks.

LegalBench-RAG, the First Open-Source Retrieval Benchmark for the Legal Domain

Nov 29, 2024

LegalBench-RAG is the first open-source benchmark for legal RAG retrieval—6,800+ queries, 79M+ characters, human-annotated spans. Evaluate legal AI today.

LlamaChunk: A General and Cost Efficient Approach to Semantic Chunking

Dec 1, 2024

Learn how LlamaChunk delivers fast, accurate semantic chunking for RAG—outperforming regex and embedding methods with LLM-guided document splitting.

Abstract image of a dark background with blurry teal, blue, and pink gradients.

Should You Use LLMs for Reranking? A Deep Dive into Pointwise, Listwise, and Cross-Encoders

SHARE

Why Reranking Matters

Why use LLMs for Reranking

Why Pointwise Reranking with LLMs Fails in Practice

Why Listwise Reranking Makes More Sense

Accuracy Evaluation Across 17 Benchmarks

Latency and Cost Evaluation Across 17 Benchmarks

The General Wisdom: Expensive Last, Cheap First

Score Calibration

Takeaways

Get started with

RELATED ARTICLES

LegalBench-RAG, the First Open-Source Retrieval Benchmark for the Legal Domain

LegalBench-RAG, the First Open-Source Retrieval Benchmark for the Legal Domain

LlamaChunk: A General and Cost Efficient Approach to Semantic Chunking

LlamaChunk: A General and Cost Efficient Approach to Semantic Chunking

AGI requires better retrieval, not just better LLMs

AGI requires better retrieval, not just better LLMs

LegalBench-RAG, the First Open-Source Retrieval Benchmark for the Legal Domain

LlamaChunk: A General and Cost Efficient Approach to Semantic Chunking