Should You Use an LLM as a Reranker? Pros, Cons, and Benchmarks

Jul 20, 2025

Should You Use an LLM as a Reranker? Pros, Cons, and Benchmarks
Should You Use an LLM as a Reranker? Pros, Cons, and Benchmarks
Should You Use an LLM as a Reranker? Pros, Cons, and Benchmarks
SHARE

Introduction

Retrieval-Augmented Generation (RAG) pipelines live and die by the quality of reranking. A good reranker takes your top-k candidates and decides which ones are actually worth showing to the user or sending to the LLM.

With large language models becoming cheaper and more capable, a natural question comes up: why not just use an LLM as your reranker?

The short answer: pointwise reranking with an LLM doesn’t make sense. The only real case is listwise reranking, and even there the cost/latency tradeoffs are brutal. Let’s unpack why.


Pointwise LLM Reranking: Why It Doesn’t Work

In a pointwise setup, you take a query, pair it with each candidate document, and ask the LLM for a relevance score. This looks like what a cross-encoder does, but in practice it fails on every axis:

  • Latency: you’re calling the model k times per query. At k=50 or 100, that’s 50–100 forward passes through an LLM. Even “fast” models like Gemini Flash choke here.

  • Cost: 500 tokens × k=75 candidates × $0.50 per 1M tokens = $187.5 per 1000 queries just for input. Multiply by output tokens and you’re easily north of $200/1k queries. A specialized reranker costs one order of magnitude less.

  • Calibration: LLMs don’t produce stable absolute scores. A “0.8 relevance” on one run means nothing on another. Even with carefully engineered prompts, evaluation of Gemini Flash across 17 datasets shows high variance and weaker average NDCG than traditional rerankers.

  • No upside: if you only need query–document scores, a cross-encoder or a purpose-built reranker (zerank-1, monoT5, etc.) is strictly better: faster, cheaper, more accurate.

Conclusion: Pointwise reranking with an LLM is just paying more to get worse results.

The Only Interesting Case: Listwise Reranking

Where LLMs are interesting is listwise reranking. Instead of scoring each document separately, you hand the LLM the query and all k candidates in a single prompt, and ask it to order them.

This flips the setup:

  • Relative vs. absolute: Traditional rerankers give absolute scores. LLMs shine at comparisons: “which of these is better?” If you only need a relative order, that’s where an LLM can add unique value.

  • Cross-document reasoning: An LLM can notice that Doc A partially answers the question but Doc B actually resolves it more directly. Cross-encoders can’t do this because they only ever see one doc at a time.

  • Flexible ranking criteria: You can ask the LLM to prefer “most factual,” “fastest method,” or “least redundant” in the same pass. This is much harder to encode into a single scoring model.

But there’s a catch:

  • Hardcore latency: Now the model has to ingest the query plus all k candidates in one context. At k=50 with 500-token docs, that’s 25k tokens per query. Even on a fast LLM, you’re in hundreds of milliseconds to multiple seconds territory.

  • Hardcore cost: 25k input tokens × $5.00 per 1M = $125 per query before output. Run this at scale and you’ll set your cloud budget on fire.

Benchmark Reality Check

We ran Gemini Flash against 17 datasets across multiple verticals. The results back this up:

  • On pointwise tasks, Gemini Flash consistently underperforms specialized rerankers on NDCG while costing 5–10× more and running slower.

  • On listwise tasks, it can capture nuanced orderings, but the compute requirements (tens of thousands of tokens per query) make it impractical for production search unless your QPS is tiny and you care more about subtle ordering than throughput.

Takeaways for Developers

  • Don’t use LLMs pointwise. It’s strictly worse: higher cost, slower latency, not calibrated, weaker accuracy.

  • Listwise is the only reasonable LLM reranking use case. Use it if you have:


    • Complex queries where relative comparison is more important than absolute scores.

    • A low-throughput, high-value domain (e.g. legal, compliance, medical research).

    • Budget to handle multi-second latencies and dollar-per-query costs.


  • For everything else, use a specialized reranker. It’s cheaper, faster, and usually more accurate.

Get started with

ZeroEntropy Animation Gif
ZeroEntropy Animation Gif

Our retrieval engine runs autonomously with the 

accuracy of a human-curated system.

Our retrieval engine runs autonomously with the 

accuracy of a human-curated system.

Our retrieval engine runs autonomously with the 

accuracy of a human-curated system.

Contact us for a custom enterprise solution with custom pricing

Contact us for a custom enterprise solution with custom pricing

Contact us for a custom enterprise solution with custom pricing

RELATED ARTICLES
Abstract image of a dark background with blurry teal, blue, and pink gradients.