SHARE
Introduction
Retrieval-Augmented Generation (RAG) pipelines live and die by the quality of reranking. A good reranker takes your top-k candidates and decides which ones are actually worth showing to the user or sending to the LLM.
With large language models becoming cheaper and more capable, a natural question comes up: why not just use an LLM as your reranker?
The short answer: pointwise reranking with an LLM doesn’t make sense. The only real case is listwise reranking, and even there the cost/latency tradeoffs are brutal. Let’s unpack why.
Pointwise LLM Reranking: Why It Doesn’t Work
In a pointwise setup, you take a query, pair it with each candidate document, and ask the LLM for a relevance score. This looks like what a cross-encoder does, but in practice it fails on every axis:
Latency: you’re calling the model k times per query. At k=50 or 100, that’s 50–100 forward passes through an LLM. Even “fast” models like Gemini Flash choke here.
Cost: 500 tokens × k=75 candidates × $0.50 per 1M tokens = $187.5 per 1000 queries just for input. Multiply by output tokens and you’re easily north of $200/1k queries. A specialized reranker costs one order of magnitude less.
Calibration: LLMs don’t produce stable absolute scores. A “0.8 relevance” on one run means nothing on another. Even with carefully engineered prompts, evaluation of Gemini Flash across 17 datasets shows high variance and weaker average NDCG than traditional rerankers.
No upside: if you only need query–document scores, a cross-encoder or a purpose-built reranker (zerank-1, monoT5, etc.) is strictly better: faster, cheaper, more accurate.
Conclusion: Pointwise reranking with an LLM is just paying more to get worse results.
The Only Interesting Case: Listwise Reranking
Where LLMs are interesting is listwise reranking. Instead of scoring each document separately, you hand the LLM the query and all k candidates in a single prompt, and ask it to order them.
This flips the setup:
Relative vs. absolute: Traditional rerankers give absolute scores. LLMs shine at comparisons: “which of these is better?” If you only need a relative order, that’s where an LLM can add unique value.
Cross-document reasoning: An LLM can notice that Doc A partially answers the question but Doc B actually resolves it more directly. Cross-encoders can’t do this because they only ever see one doc at a time.
Flexible ranking criteria: You can ask the LLM to prefer “most factual,” “fastest method,” or “least redundant” in the same pass. This is much harder to encode into a single scoring model.
But there’s a catch:
Hardcore latency: Now the model has to ingest the query plus all k candidates in one context. At k=50 with 500-token docs, that’s 25k tokens per query. Even on a fast LLM, you’re in hundreds of milliseconds to multiple seconds territory.
Hardcore cost: 25k input tokens × $5.00 per 1M = $125 per query before output. Run this at scale and you’ll set your cloud budget on fire.
Benchmark Reality Check
We ran Gemini Flash against 17 datasets across multiple verticals. The results back this up:
On pointwise tasks, Gemini Flash consistently underperforms specialized rerankers on NDCG while costing 5–10× more and running slower.
On listwise tasks, it can capture nuanced orderings, but the compute requirements (tens of thousands of tokens per query) make it impractical for production search unless your QPS is tiny and you care more about subtle ordering than throughput.
Takeaways for Developers
Don’t use LLMs pointwise. It’s strictly worse: higher cost, slower latency, not calibrated, weaker accuracy.
Listwise is the only reasonable LLM reranking use case. Use it if you have:
Complex queries where relative comparison is more important than absolute scores.
A low-throughput, high-value domain (e.g. legal, compliance, medical research).
Budget to handle multi-second latencies and dollar-per-query costs.
For everything else, use a specialized reranker. It’s cheaper, faster, and usually more accurate.
Get started with
RELATED ARTICLES
