Should You Use LLMs for Reranking? A Deep Dive into Pointwise, Listwise, and Cross-Encoders

Sep 5, 2025

SHARE

TLDR:

  • Use cheap recall first. Use a reranker (i.e. cross encoder) to reorder the top 50-100. Only run expensive LLM listwise reasoning on the top few if the use case demands it.

  • Pointwise LLM scoring is uncalibrated and slow. Listwise LLM ranking is more consistent, but is only viable for very small candidate lists due to context length and latency.

    • Pro: LLMs do also have the benefit of customizable prompt engineering.

    • Con: However, they output very noisy and inconsistent scores, can miss relevant documents or use the wrong document id, or can fail to assign a score to every document.

  • Dedicated reranker models output consistent, interpretable, and deterministic scores between 0 and 1 for how relevant a document is to a query.

  • The result is that LLMs are less accurate, slower, and more expensive than dedicated rerankers such as ZeroEntropy's zerank-1.

    • On 17 datasets, zerank-1 reaches about 0.78 NDCG@10, meanwhile GPT-5-mini and nano score 0.70 NDCG@10 (Even when excluding format failures/timeouts from the LLM). Meanwhile, zerank-1 has a p50 130ms and is 10-30x cheaper than small LLMs used for rerank.

-> You might think to finetune an LLM on ranking tasks in order to improve results and resolve formatting issues. And, you'd be right! That's exactly what a dedicated reranker model is.

At ZeroEntropy (YC W25), we spend a lot of time thinking about one deceptively simple question:

How do you ensure the right document appears at the top of search results?

If you’re building retrieval-heavy systems like RAG or AI Agents, you already know the story: keyword search (BM25) is fast but brittle, semantic embeddings are flexible but noisy. The missing ingredient is reranking: a second-pass model that takes a query and candidate documents and reorders them to maximize precision.

In this post, we’ll look at reranking through the lens of large language models (LLMs), why naïvely applying them often fails, and why specialized cross-encoders still remain the most practical tool in real-world pipelines.

Why Reranking Matters

First-stage retrieval (BM25 or vector search) maximizes recall. That ensures the correct answer is somewhere in the top-100 results. But unless it’s boosted to the top few positions, it will never reach the user or the LLM that consumes it.

That’s where reranking comes in: given a query–document pair, a reranker assigns a calibrated relevance score and reorders the candidates accordingly. Unlike embeddings, which encode each document in isolation, rerankers see both query and document at the same time. This context-awareness is what makes rerankers indispensable in modern retrieval pipelines.

Why use LLMs for Reranking

There are two primary ways to repurpose autoregressive LLMs for reranking:

  1. Pointwise reranking

    You prompt the LLM with a query and a single candidate document and ask it to output a relevance score. Do this independently for each candidate and then sort by score.


  2. Listwise reranking

    You feed the LLM the query along with the entire candidate set and ask it to reorder the list or select the most relevant items.

At first glance, both sound appealing, why not leverage the reasoning ability of LLMs for reranking? But each comes with serious drawbacks.

Why Pointwise Reranking with LLMs Fails in Practice

Pointwise reranking seems simple, but in practice it’s brittle:

  • Calibration issues: LLMs are not trained to produce scores in a fixed range like [0, 1]. You may get out-of-range values, reasoning text mixed in with numbers, or wildly inconsistent scales.

  • Slow and costly: You must run the LLM once per candidate. At 100 candidates per query, latency and cost balloon.

  • Not fine-tuned for reranking: General-purpose models (even gpt-4.1-nano or gpt-5-mini) lack the supervision needed to produce stable scores. They sometimes hallucinate explanations, add random tokens, or collapse all results into the same score.

In short, pointwise reranking with LLMs gives you the worst of both worlds: slow inference and unreliable outputs.

Why Listwise Reranking Makes More Sense

Listwise reranking aligns better with how humans judge relevance: by comparing candidates relative to each other. Given the full list, the LLM can reason globally and enforce consistency across rankings.

But there are trade-offs:

  • Still not trained for the task: Even listwise prompts risk random outputs—extra text, invalid indices, or reordered lists that don’t align with expectations.

  • Extremely expensive: Encoding all candidates into a single long prompt pushes token limits and multiplies latency. Remember that LLM attention costs O(n²), where n is the number of tokens.

  • Only feasible for very small candidate sets: Because of this, listwise reranking makes sense if you’re refining the top 5–10 results, not the top 100-200.

Accuracy Evaluation Across 17 Benchmarks

We evaluated multiple reranking models across MTEB, BEIR, MS MARCO, and domain-specific datasets. Below are highlights comparing ZeroEntropy’s reranker against listwise LLM-based baselines, where we limited the context to 50k tokens.

NDCG@10 (higher is better):

............... Finance ...............
 openai_small 0.83113 ± 0.01375 
cohere 0.77950 ± 0.01600
gpt-4o-mini 0.85127 ± 0.01331 
gpt-4.1-mini 0.87215 ± 0.01369 
gpt-5-mini 0.84039 ± 0.01406
gpt-5-nano 0.83590 ± 0.01551
zeroentropy-large 0.85130 ± 0.01415 

................. Legal ................
 openai_small 0.80430 ± 0.01106 
cohere 0.86830 ± 0.00960 
gpt-4o-mini 0.83868 ± 0.01058 
gpt-4.1-mini 0.85756 ± 0.01070
gpt-5-mini 0.85962 ± 0.01015
gpt-5-nano 0.82983 ± 0.01159
zeroentropy-large 0.89705 ± 0.00851

.................. Code .................
 openai_small 0.74589 ± 0.02607
cohere 0.90520 ± 0.01578
gpt-4o-mini 0.86173 ± 0.02079
gpt-4.1-mini 0.95349 ± 0.01490
gpt-5-mini 0.96830 ± 0.00985
gpt-5-nano 0.90037 ± 0.02018 
zeroentropy-large 0.97054 ± 0.00736

.................. Biomed ...............
 openai_small 0.75608 ± 0.01480
cohere 0.87319 ± 0.01092
gpt-4o-mini 0.87285 ± 0.01217
gpt-4.1-mini 0.90664 ± 0.01383
gpt-5-mini 0.90755 ± 0.01055
gpt-5-nano 0.89926 ± 0.01449
zeroentropy-large 0.93283 ± 0.00963

................ GeneralQA ...............
 openai_small 0.48821 ± 0.01977
cohere 0.60654 ± 0.02073
gpt-4o-mini 0.54064 ± 0.02107
gpt-4.1-mini 0.63403 ± 0.04691
gpt-5-mini 0.57998 ± 0.02329
gpt-5-nano 0.72253 ± 0.04160
zeroentropy-large 0.84780 ± 0.01320

.............. Summarization .............
 openai_small 0.06687 ± 0.01163
cohere 0.38622 ± 0.02703
gpt-4o-mini 0.07144 ± 0.01258
gpt-4.1-mini 0.07305 ± 0.01291
gpt-5-mini 0.07002 ± 0.01227
gpt-5-nano 0.07129 ± 0.01262
zeroentropy-large 0.34115 ± 0.02575

Average NDCG@10 across all datasets:

Model

Average NDCG@10 across all datasets

OpenAI text-embedding-small (baseline)

0.61753

Cohere rerank-3.5

0.71939

gpt-4o-mini

0.66319

gpt-4.1-mini

0.71312

gpt-5-mini

0.69800

gpt-5-nano

0.71163

zerank-1

0.77667

The pattern is consistent: while LLMs occasionally spike on narrow tasks, cross-encoders trained for reranking outperform or match them across the board, with far lower latency and cost.

Latency and Cost Evaluation Across 17 Benchmarks

Here is a recap of p50 latency and price per million tokens for these models. Because they run at query time, the latency budget can be extremely tight: every extra millisecond counts. And when you rely on third-party LLM APIs, you also risk unpredictable delays, with random spikes of several seconds that can break the user experience.

Model

Cost per one million tokens

(input tokens)

Multiplier compared to zerank-1

Cohere rerank-3.5

$0.050

2x

gpt-4o-mini

$0.60

24x

gpt-4.1-mini

$0.80

32x

gpt-5-mini

$0.250

10x

gpt-5-nano

$0.050

2x

zerank-1

$0.025

1x

Model

p50 for 75kb input

Multiplier compared to zerank-1

Cohere rerank-3.5

198.1ms

1.5x

gpt-4o-mini

1090ms

8.5x

gpt-4.1-mini

740ms

5.5x

gpt-5-mini

2180ms

17x

gpt-5-nano

1520ms

2x

zerank-1

129.7ms

1x

The General Wisdom: Expensive Last, Cheap First

The lesson is clear:

Always apply the most expensive techniques to the smallest number of candidates.

That’s why production systems layer retrieval like this:

  1. Cheap first-stage retrieval (BM25, embeddings, or hybrid) to maximize recall.

  2. Specialized cross-encoder reranker to boost precision in the top 50–100.

  3. (Optional) Apply an expensive listwise LLM-based reasoning step only on the top handful of candidates, if the use case demands it.

Cross-encoders like our zerank-1 and zerank-1-small are trained specifically for this task: given a query and document, output a calibrated relevance score. They deliver consistency, efficiency, and accuracy that general-purpose LLMs cannot match.

Score Calibration

The problem. In production, a reranker’s score must be comparable across queries. In our studies, pointwise LLM prompts produced score drift across prompts and candidate sets, and per-query normalizations erased signal. A score like 0.82 should mean the same thing no matter the query, prompt, or document length. With pointwise LLM prompts it does not. 

What zerank-1 does. We stop asking for absolute numbers. We collect pairwise preferences, train a small comparator to predict those cheaply, then turn many head-to-heads into a single global scale with an Elo-style model. We learn a per-query bias from cross-query matchups to handle barren versus rich queries, and fit one temperature on a held-out set so probabilities line up with observed win rates.

Result and impact. Stable, comparable probabilities across queries. One threshold to tune, clean blending with BM25 and embeddings, and rankings that stay consistent across releases. Lower calibration error with equal or better NDCG.

Takeaways

  • Pointwise reranking with LLMs is impractical: uncalibrated, slow, and unreliable.

  • Listwise reranking is conceptually stronger but only viable for tiny candidate sets, due to cost and fragility.

  • Specialized rerankers like zerank-1 deliver calibrated, consistent, and scalable results—and should be your go-to tool for reranking.

  • Reserve expensive LLM reasoning for the very top-k documents where absolute precision matters most.

At ZeroEntropy, our mission is to push retrieval systems toward both higher accuracy and real-world deployability. Our latest reranker models (Apache 2.0 licensed) are now available on HuggingFace and via our API—try them out today.

Get started with

ZeroEntropy Animation Gif
ZeroEntropy Animation Gif

Our retrieval engine runs autonomously with the 

accuracy of a human-curated system.

Our retrieval engine runs autonomously with the 

accuracy of a human-curated system.

Our retrieval engine runs autonomously with the 

accuracy of a human-curated system.

Contact us for a custom enterprise solution with custom pricing

Contact us for a custom enterprise solution with custom pricing

Contact us for a custom enterprise solution with custom pricing

RELATED ARTICLES
Abstract image of a dark background with blurry teal, blue, and pink gradients.