✨ Join the Context Engineers Discord community for an exclusive talk with the ZeroEntropy founders this Friday!

Should You Use an LLM as a Reranker? Pros, Cons, and Benchmarks

Jul 20, 2025

Introduction

Retrieval-Augmented Generation (RAG) pipelines live and die by the quality of reranking. A good reranker takes your top-k candidates and decides which ones are actually worth showing to the user or sending to the LLM.

With large language models becoming cheaper and more capable, a natural question comes up: why not just use an LLM as your reranker?

The short answer: pointwise reranking with an LLM doesn’t make sense. The only real case is listwise reranking, and even there the cost/latency tradeoffs are brutal. Let’s unpack why.

Pointwise LLM Reranking: Why It Doesn’t Work

In a pointwise setup, you take a query, pair it with each candidate document, and ask the LLM for a relevance score. This looks like what a cross-encoder does, but in practice it fails on every axis:

Latency: you’re calling the model k times per query. At k=50 or 100, that’s 50–100 forward passes through an LLM. Even “fast” models like Gemini Flash choke here.
Cost: 500 tokens × k=75 candidates × $0.50 per 1M tokens = $187.5 per 1000 queries just for input. Multiply by output tokens and you’re easily north of $200/1k queries. A specialized reranker costs one order of magnitude less.
Calibration: LLMs don’t produce stable absolute scores. A “0.8 relevance” on one run means nothing on another. Even with carefully engineered prompts, evaluation of Gemini Flash across 17 datasets shows high variance and weaker average NDCG than traditional rerankers.
No upside: if you only need query–document scores, a cross-encoder or a purpose-built reranker (zerank-1, monoT5, etc.) is strictly better: faster, cheaper, more accurate.

Conclusion: Pointwise reranking with an LLM is just paying more to get worse results.

The Only Interesting Case: Listwise Reranking

Where LLMs are interesting is listwise reranking. Instead of scoring each document separately, you hand the LLM the query and all k candidates in a single prompt, and ask it to order them.

This flips the setup:

Relative vs. absolute: Traditional rerankers give absolute scores. LLMs shine at comparisons: “which of these is better?” If you only need a relative order, that’s where an LLM can add unique value.
Cross-document reasoning: An LLM can notice that Doc A partially answers the question but Doc B actually resolves it more directly. Cross-encoders can’t do this because they only ever see one doc at a time.
Flexible ranking criteria: You can ask the LLM to prefer “most factual,” “fastest method,” or “least redundant” in the same pass. This is much harder to encode into a single scoring model.

But there’s a catch:

Hardcore latency: Now the model has to ingest the query plus all k candidates in one context. At k=50 with 500-token docs, that’s 25k tokens per query. Even on a fast LLM, you’re in hundreds of milliseconds to multiple seconds territory.
Hardcore cost: 25k input tokens × $5.00 per 1M = $125 per query before output. Run this at scale and you’ll set your cloud budget on fire.

Benchmark Reality Check

We ran Gemini Flash against 17 datasets across multiple verticals. The results back this up:

On pointwise tasks, Gemini Flash consistently underperforms specialized rerankers on NDCG while costing 5–10× more and running slower.
On listwise tasks, it can capture nuanced orderings, but the compute requirements (tens of thousands of tokens per query) make it impractical for production search unless your QPS is tiny and you care more about subtle ordering than throughput.

Takeaways for Developers

Don’t use LLMs pointwise. It’s strictly worse: higher cost, slower latency, not calibrated, weaker accuracy.
Listwise is the only reasonable LLM reranking use case. Use it if you have:
- Complex queries where relative comparison is more important than absolute scores.
- A low-throughput, high-value domain (e.g. legal, compliance, medical research).
- Budget to handle multi-second latencies and dollar-per-query costs.
For everything else, use a specialized reranker. It’s cheaper, faster, and usually more accurate.

Get started with

Our retrieval engine runs autonomously with the

accuracy of a human-curated system.

Our retrieval engine runs autonomously with the

accuracy of a human-curated system.

Our retrieval engine runs autonomously with the

accuracy of a human-curated system.

Start Now

View Docs

GitHub

Discord

Slack

Enterprise

LegalBench-RAG, the First Open-Source Retrieval Benchmark for the Legal Domain

Nov 29, 2024

LegalBench-RAG is the first open-source benchmark for legal RAG retrieval—6,800+ queries, 79M+ characters, human-annotated spans. Evaluate legal AI today.

LegalBench-RAG, the First Open-Source Retrieval Benchmark for the Legal Domain

Nov 29, 2024

LegalBench-RAG is the first open-source benchmark for legal RAG retrieval—6,800+ queries, 79M+ characters, human-annotated spans. Evaluate legal AI today.

LlamaChunk: A General and Cost Efficient Approach to Semantic Chunking

Dec 1, 2024

Learn how LlamaChunk delivers fast, accurate semantic chunking for RAG—outperforming regex and embedding methods with LLM-guided document splitting.

LlamaChunk: A General and Cost Efficient Approach to Semantic Chunking

Dec 1, 2024

Learn how LlamaChunk delivers fast, accurate semantic chunking for RAG—outperforming regex and embedding methods with LLM-guided document splitting.

AGI requires better retrieval, not just better LLMs

Dec 2, 2024

AGI needs more than LLMs—it needs smarter retrieval. Learn how to identify failure modes in RAG and evaluate search accuracy with ZeroEntropy’s benchmarks.

AGI requires better retrieval, not just better LLMs

Dec 2, 2024

AGI needs more than LLMs—it needs smarter retrieval. Learn how to identify failure modes in RAG and evaluate search accuracy with ZeroEntropy’s benchmarks.

LegalBench-RAG, the First Open-Source Retrieval Benchmark for the Legal Domain

Nov 29, 2024

LegalBench-RAG is the first open-source benchmark for legal RAG retrieval—6,800+ queries, 79M+ characters, human-annotated spans. Evaluate legal AI today.

LlamaChunk: A General and Cost Efficient Approach to Semantic Chunking

Dec 1, 2024

Learn how LlamaChunk delivers fast, accurate semantic chunking for RAG—outperforming regex and embedding methods with LLM-guided document splitting.

Abstract image of a dark background with blurry teal, blue, and pink gradients.

Should You Use an LLM as a Reranker? Pros, Cons, and Benchmarks

SHARE

Introduction

Pointwise LLM Reranking: Why It Doesn’t Work

The Only Interesting Case: Listwise Reranking

Benchmark Reality Check

Takeaways for Developers

Get started with

RELATED ARTICLES

LegalBench-RAG, the First Open-Source Retrieval Benchmark for the Legal Domain

LegalBench-RAG, the First Open-Source Retrieval Benchmark for the Legal Domain

LlamaChunk: A General and Cost Efficient Approach to Semantic Chunking

LlamaChunk: A General and Cost Efficient Approach to Semantic Chunking

AGI requires better retrieval, not just better LLMs

AGI requires better retrieval, not just better LLMs

LegalBench-RAG, the First Open-Source Retrieval Benchmark for the Legal Domain

LlamaChunk: A General and Cost Efficient Approach to Semantic Chunking