Should You Use an LLM as a Reranker? Pros, Cons, and Benchmarks

Jul 20, 2025

Should You Use an LLM as a Reranker? Pros, Cons, and Benchmarks
Should You Use an LLM as a Reranker? Pros, Cons, and Benchmarks
Should You Use an LLM as a Reranker? Pros, Cons, and Benchmarks
SHARE

Introduction

LLM Reranking in RAG: A Pragmatic Cost-Benefit Analysis

Reranking quality often determines whether your RAG pipeline succeeds or fails. As LLMs become cheaper and more capable, many engineers wonder: should we replace our specialized rerankers with LLMs?

After benchmarking Gemini Flash across 17 retrieval datasets and deploying LLM-based reranking in production, here's what we've learned about when it makes sense and when it doesn't.

The Economics of Pointwise LLM Reranking

Pointwise reranking treats each query-document pair independently, scoring them one at a time. This mirrors how cross-encoders work, but with fundamentally different economics.

The math doesn't work out. Consider a typical setup with k=75 candidates:

  • Latency: 75 sequential LLM calls per query. Even with Gemini Flash at ~200ms per call, you're looking at 15+ seconds end-to-end. Parallelization helps but introduces orchestration complexity and still bottlenecks on rate limits.

  • Cost: With 500 tokens per query-document pair at $0.50/1M input tokens, you're paying $18.75 per 1,000 queries just for inputs. Add output tokens for scores and explanations, and you're approaching $25-30 per 1,000 queries. A specialized reranker like Cohere Rerank costs $2/1k queries—roughly 10x cheaper.

  • Accuracy: Our benchmarks show Gemini Flash averaging 0.68 NDCG@10 across datasets, compared to 0.74 for purpose-built rerankers like BGE-reranker-v2. LLMs produce unstable scores that vary between runs even with temperature=0, making threshold-based filtering unreliable.

When pointwise might make sense: If you need natural language explanations for why each document was scored a certain way (for debugging, auditing, or user-facing justifications), an LLM becomes the only option. But recognize you're paying a premium for interpretability, not accuracy.

Listwise Reranking: The Only Compelling Use Case

Listwise reranking flips the paradigm. Instead of scoring documents independently, you provide the LLM with the query and all k candidates in a single context window, asking it to produce a ranked ordering.

This approach unlocks capabilities that traditional rerankers can't match:

Cross-document reasoning: An LLM can identify that Document A provides background context while Document B directly answers the question, even if both score similarly on surface-level relevance. Cross-encoders see one document at a time and can't make these comparisons.

Flexible ranking criteria: You can adapt ranking logic in natural language without retraining models. "Prioritize recent sources," "prefer academic papers over blog posts," or "rank by completeness of answer" become prompt modifications rather than model architecture changes.

Deduplication and complementarity: An LLM can recognize when two highly-ranked documents are near-duplicates and demote one, or identify when documents complement each other and should appear together.

The tradeoffs are severe:

  • Latency: At k=50 with 500-token documents, you're sending 25,000+ tokens per query. Even optimized LLM inference takes 500ms-2s for this workload. For user-facing search, this is often unacceptable.

  • Cost: 25k input tokens at $5.00/1M tokens = $0.125 per query, or $125 per 1,000 queries before outputs. This works for high-value, low-QPS scenarios but becomes prohibitive at scale.

  • Context window constraints: You're limited by the model's context length. For k=100 or documents with citations/metadata, you may exceed even 128k-token windows.

A Hybrid Strategy That Actually Works

In production, we've found success with a staged approach:

  1. Initial retrieval: Use a fast embedding model to pull top-200 candidates (Sub-100ms, effectively free)

  2. First-stage reranking: Apply a specialized cross-encoder to narrow to top-20 (5-10ms, $0.50/1k queries)

  3. LLM listwise reranking: Use an LLM to produce final ordering of top-10 (200-500ms, $5-10/1k queries depending on document length)

This gives you the best of both worlds: the cross-document reasoning of LLM reranking where it matters most (the final results the user sees), while keeping costs and latency manageable by limiting the LLM's workload.

Cost optimization techniques that can make LLM reranking more viable:

  • Prompt caching: If your queries share common structure, cache the instruction portion to reduce effective token counts by 30-50%

  • Document summarization: Compress each document to 100-200 tokens before sending to the reranker, especially if you're ranking on relevance rather than completeness

  • Batch processing: For offline pipelines or async workflows, batch multiple queries together to amortize overhead

  • Smaller fine-tuned models: Fine-tune a 7B model specifically for your domain's reranking task. We've seen 8x cost reduction with comparable quality to GPT-4 on domain-specific corpora.

Benchmark Results: Setting Expectations

We evaluated Gemini Flash against BGE-reranker-v2 and Cohere Rerank across 17 datasets spanning e-commerce, legal documents, technical documentation, and news articles.

Pointwise results (NDCG@10):

  • BGE-reranker-v2: 0.74 average, 12ms median latency

  • Gemini Flash: 0.68 average, 185ms median latency

  • Cost per 1k queries: $2 vs $27

Listwise results (NDCG@10, top-20 candidates):

  • BGE-reranker-v2: 0.74 average

  • Gemini Flash: 0.78 average, 420ms median latency

  • Cost per 1k queries: $2 vs $18

The listwise improvement is real but modest, and comes at 9x the cost and 35x the latency.

When to Use What

Use a specialized reranker (BGE, Cohere, jina-reranker) when:

  • You need low latency (<50ms) and high throughput

  • Your budget is constrained and you're processing high query volumes

  • Your ranking criteria are stable and can be captured in training data

  • You need consistent, calibrated scores for downstream filtering

Use LLM listwise reranking when:

  • You're in a low-QPS, high-value domain (legal research, medical literature review, compliance)

  • Cross-document reasoning materially improves result quality

  • You need flexible ranking criteria that change frequently

  • You can afford 500ms-2s latency and $10-100 per 1,000 queries

  • You're implementing a hybrid pipeline where the LLM only reranks top-10 results

Avoid LLM pointwise reranking unless:

  • You specifically need natural language explanations for each relevance score

  • You're in research/experimentation mode and cost doesn't matter

Get started with

ZeroEntropy Animation Gif
ZeroEntropy Animation Gif

Our retrieval engine runs autonomously with the 

accuracy of a human-curated system.

Our retrieval engine runs autonomously with the 

accuracy of a human-curated system.

Our retrieval engine runs autonomously with the 

accuracy of a human-curated system.

Contact us for a custom enterprise solution with custom pricing

Contact us for a custom enterprise solution with custom pricing

Contact us for a custom enterprise solution with custom pricing

RELATED ARTICLES
Abstract image of a dark background with blurry teal, blue, and pink gradients.