Large language models are expensive to run — and the cost scales directly with how much text you put in the context window. In many real-world pipelines, the bottleneck isn't the LLM's reasoning ability; it's the sheer volume of context you have to provide before the LLM can reason at all.
This is where ZeroEntropy's zerank-2 reranker opens up a pattern that goes well beyond traditional search: using a reranker as a calibrated binary classifier to decide, page by page or chunk by chunk, what actually belongs in your LLM's context.
What Is zerank-2?
zerank-2 is ZeroEntropy's state-of-the-art multilingual cross-encoder reranker. Cross-encoders differ from embedding models in a fundamental way: rather than independently encoding a query and a document into vectors and comparing them, a cross-encoder reads the query and the document together and outputs a single relevance score. This joint attention makes cross-encoders substantially more accurate — at the cost of being slower for large-scale retrieval, which is why they are typically applied as a second-stage reranker on a shortlist of candidates.
zerank-2 pushes the state of the art on several axes:
Instruction-following: The model accepts a natural-language instruction alongside the query, letting you inject domain context, terminology, or custom ranking criteria. A healthcare query for "acute kidney injury" can be told to treat "AKI" as a synonym, or to prioritize lab values over clinical notes.
Calibrated scores: This is the key property for the use case in this post. The model is trained so that a score of 0.8 means approximately 80% relevance — consistently, across query types and domains. The score is not just a relative ranking signal; it carries absolute probabilistic meaning.
Multilingual: Trained across 100+ languages with near-English performance, including challenging scripts and code-switching queries.
Fast and cheap: At $0.025 per 1M tokens — half the price of Cohere Rerank 3.5 — and with p50 latency around 130ms, it fits comfortably into production pipelines.
The Core Insight: A Reranker Score Is a Relevance Probability
Most people use rerankers to sort a list. But zerank-2's calibrated scores unlock a different usage pattern: thresholding.
For any query-document pair, the model produces a score in [0, 1]. Because of the calibration guarantee, you can interpret this score as a probability: "how likely is it that this chunk is relevant to this query?" That turns the reranker into a binary classifier:
This is significantly cheaper than asking an LLM to classify relevance, or to feed hundreds of thousands of tokens at every pass. A classification call to GPT-5 on a 2,000-token chunk costs roughly 80-100x more than a zerank-2 call on the same chunk. More importantly, zerank-2 can score hundreds of chunks in parallel with sub-200ms latency, while LLM classification calls serialize poorly (because of lack of calibration) and are far slower end-to-end.
Use Case: Context Compression for Long Clinical Documents
To make this concrete, consider the problem ZeroEntropy tackled with a leading healthcare company that automates clinical review of prior authorization requests.
The setup: A prior authorization request is a lengthy document — often 100–200 pages of clinical notes, lab results, imaging reports, and physician letters. A clinical reviewer must assess whether the patient meets a set of coverage criteria, each expressed as a structured question:
"Is there documentation of a failed trial of first-line therapy?"
"Does the patient have a confirmed diagnosis of moderate-to-severe disease?"
"Are there contraindications documented for alternative treatments?"
There may be 50–100 such criteria per review, and the relevant evidence for each criterion is scattered across just a few pages of the full document.
The naive approach: Send the entire document to an LLM for each criterion. A 150-page document at ~2,000 characters per page is ~300,000 characters of context — multiplied across 80 criteria, that is 24 million characters fed to the LLM per case. At scale this is both slow and prohibitively expensive.
The zerank-2 approach: Score every page of the document against each criterion question. Keep only the pages that score above a threshold (or the top-K pages). Send only those to the LLM.
from zeroentropy importAsyncZeroEntropyzclient = AsyncZeroEntropy()asyncdef score_pages_for_criterion(pages: list[str],criterion_question: str,batch_size: int = 10,) -> list[float]:"""Score each page's relevance to a clinical criterion."""
all_scores:list[float] = [0.0] * len(pages)foriinrange(0,len(pages),batch_size):batch = pages[i:i + batch_size]response = awaitzclient.models.rerank(model="zerank-2",query=criterion_question,documents=batch,)forresultinresponse.results:all_scores[i + result.index] = result.relevance_scorereturnall_scoresdef select_relevant_pages(pages: list[str],scores: list[float],threshold:float = 0.4,) -> list[str]:"""Return only pages that exceed the relevance threshold."""return[page forpage,scoreinzip(pages,scores)ifscore >= threshold]
from zeroentropy importAsyncZeroEntropyzclient = AsyncZeroEntropy()asyncdef score_pages_for_criterion(pages: list[str],criterion_question: str,batch_size: int = 10,) -> list[float]:"""Score each page's relevance to a clinical criterion."""
all_scores:list[float] = [0.0] * len(pages)foriinrange(0,len(pages),batch_size):batch = pages[i:i + batch_size]response = awaitzclient.models.rerank(model="zerank-2",query=criterion_question,documents=batch,)forresultinresponse.results:all_scores[i + result.index] = result.relevance_scorereturnall_scoresdef select_relevant_pages(pages: list[str],scores: list[float],threshold:float = 0.4,) -> list[str]:"""Return only pages that exceed the relevance threshold."""return[page forpage,scoreinzip(pages,scores)ifscore >= threshold]
from zeroentropy importAsyncZeroEntropyzclient = AsyncZeroEntropy()asyncdef score_pages_for_criterion(pages: list[str],criterion_question: str,batch_size: int = 10,) -> list[float]:"""Score each page's relevance to a clinical criterion."""
all_scores:list[float] = [0.0] * len(pages)foriinrange(0,len(pages),batch_size):batch = pages[i:i + batch_size]response = awaitzclient.models.rerank(model="zerank-2",query=criterion_question,documents=batch,)forresultinresponse.results:all_scores[i + result.index] = result.relevance_scorereturnall_scoresdef select_relevant_pages(pages: list[str],scores: list[float],threshold:float = 0.4,) -> list[str]:"""Return only pages that exceed the relevance threshold."""return[page forpage,scoreinzip(pages,scores)ifscore >= threshold]
The result is a compressed context containing only the pages with evidence relevant to that specific criterion — typically 3–10 pages instead of 150. The LLM then reasons over a manageable, high-signal context.
The Recall vs. Context Size Tradeoff
The key question is: how much context do you need to preserve before you've captured essentially all the relevant evidence?
The chart below shows Recall@K (fraction of ground-truth relevant pages captured) against total characters seen, measured across a dataset of real prior authorization documents with human-annotated ground truth.
The curve has a characteristic shape: recall rises steeply at first, then flattens as you include more pages. In practice, the top 10–20 pages by zerank score capture the vast majority of ground-truth relevant pages across all criteria — often 90%+ recall at less than 15% of the total document characters.
This is the fundamental compression win: you can discard 85%+ of document characters while preserving 90%+ of the relevant content.
Setting the Threshold
Choosing a threshold is a one-time calibration step, and zerank-2's calibrated scores make it interpretable rather than arbitrary.
Option 1: Fixed threshold based on semantics
Because zerank-2's scores are calibrated probabilities, you can pick a threshold with direct semantic meaning:
Threshold
Meaning
0.2
Include anything with 20%+ chance of relevance (high recall, lower precision)
0.4
Balanced — good default for most RAG pipelines
0.6
Include only clearly relevant content (high precision, some recall loss)
0.8
Very conservative — near-certain relevance only
Start at 0.4 for most applications and adjust based on downstream LLM performance.
Option 2: Calibrate on a labeled validation set
If you have ground-truth labels, you can directly optimize the threshold against recall:
This gives you a principled threshold tied to a specific recall guarantee — for example, "include all content needed to answer 95% of criteria correctly."
Option 3: Top-K instead of threshold
If your downstream pipeline has a hard context budget (e.g. a 32K token limit), use top-K selection rather than a threshold:
Top-K and threshold approaches can be combined: take up to K pages, but only those above a minimum threshold score, ensuring you don't pad context with low-relevance material when K pages aren't needed.
Beyond Context Compression: General Binary Classification
The context compression use case is specific to RAG, but the underlying pattern — zerank-2 as a binary classifier — generalizes broadly.
Document routing: In a multi-stage pipeline, use zerank to decide which documents warrant expensive downstream processing. Score each document against a query or policy description; route only those above 0.5 to an LLM for detailed analysis.
Duplicate / near-duplicate detection: Frame it as a relevance query: "Is this document essentially the same as the reference?" A high score flags near-duplicates.
Content moderation / policy compliance: Score content against a policy description query. "Does this text contain instructions that could cause harm?" with a low threshold catches borderline cases for human review.
Multi-label classification: Run one zerank call per label/class. Each call is cheap; the calibrated scores give you a probability per class that you can threshold independently.
asyncdef classify_document(document: str,class_descriptions: dict[str, str],
threshold:float = 0.5,) -> dict[str,bool]:"""
Classify a document against multiple categories using zerank-2.Each category is describedas a natural language query.
"""
results = {}forclass_name,descriptioninclass_descriptions.items():response = awaitzclient.models.rerank(model="zerank-2",query=description,documents=[document],)score = response.results[0].relevance_scoreresults[class_name] = score >= thresholdreturnresults
# Example:clinical document triagecategories = {"contains_lab_results":"laboratory test results, blood work, or diagnostic measurements","contains_imaging":"radiology report, MRI, CT scan, X-ray, or imaging findings","contains_diagnosis":"diagnosis, clinical assessment, or documented medical condition","contains_medication":"medication, prescription, dosage, or drug therapy",}labels = awaitclassify_document(clinical_note,categories,threshold=0.4)
asyncdef classify_document(document: str,class_descriptions: dict[str, str],
threshold:float = 0.5,) -> dict[str,bool]:"""
Classify a document against multiple categories using zerank-2.Each category is describedas a natural language query.
"""
results = {}forclass_name,descriptioninclass_descriptions.items():response = awaitzclient.models.rerank(model="zerank-2",query=description,documents=[document],)score = response.results[0].relevance_scoreresults[class_name] = score >= thresholdreturnresults
# Example:clinical document triagecategories = {"contains_lab_results":"laboratory test results, blood work, or diagnostic measurements","contains_imaging":"radiology report, MRI, CT scan, X-ray, or imaging findings","contains_diagnosis":"diagnosis, clinical assessment, or documented medical condition","contains_medication":"medication, prescription, dosage, or drug therapy",}labels = awaitclassify_document(clinical_note,categories,threshold=0.4)
asyncdef classify_document(document: str,class_descriptions: dict[str, str],
threshold:float = 0.5,) -> dict[str,bool]:"""
Classify a document against multiple categories using zerank-2.Each category is describedas a natural language query.
"""
results = {}forclass_name,descriptioninclass_descriptions.items():response = awaitzclient.models.rerank(model="zerank-2",query=description,documents=[document],)score = response.results[0].relevance_scoreresults[class_name] = score >= thresholdreturnresults
# Example:clinical document triagecategories = {"contains_lab_results":"laboratory test results, blood work, or diagnostic measurements","contains_imaging":"radiology report, MRI, CT scan, X-ray, or imaging findings","contains_diagnosis":"diagnosis, clinical assessment, or documented medical condition","contains_medication":"medication, prescription, dosage, or drug therapy",}labels = awaitclassify_document(clinical_note,categories,threshold=0.4)
The key advantage over an LLM classifier: zerank-2 processes all categories in parallel, requires no prompt engineering for output formatting, and produces calibrated probabilities rather than token-sampled yes/no answers. For a document with 10 categories, a zerank-2 classification is roughly 50–100x cheaper than an equivalent GPT-5 classification.
Putting It Together
The pattern zerank-2 enables in these pipelines is simple but powerful:
Express your classification task as a relevance query. What would a relevant chunk look like? Describe it in natural language.
Score your candidates. Run zerank-2 over all chunks, pages, or documents. This is fast and cheap.
Apply a threshold. Use the calibrated score to make a binary keep/discard decision. The score's probabilistic interpretation makes threshold selection principled.
Send only what matters to the LLM. The LLM sees a high-signal, compressed context and reasons more accurately — at a fraction of the cost.
zerank-2 is not a replacement for LLMs. It is a fast, cheap filter that makes LLMs more effective by ensuring they spend their compute on content that actually matters.
Getting Started
ZeroEntropy offers a free 2-week trial with 1,000 queries.
from zeroentropy importAsyncZeroEntropyclient = AsyncZeroEntropy() # uses ZEROENTROPY_API_KEY env varresponse = awaitclient.models.rerank(model="zerank-2",query="your query or criterion here",documents=["chunk 1 text","chunk 2 text","..."],)forresultinresponse.results:print(f"Document{result.index}:score={result.relevance_score:.3f}")
from zeroentropy importAsyncZeroEntropyclient = AsyncZeroEntropy() # uses ZEROENTROPY_API_KEY env varresponse = awaitclient.models.rerank(model="zerank-2",query="your query or criterion here",documents=["chunk 1 text","chunk 2 text","..."],)forresultinresponse.results:print(f"Document{result.index}:score={result.relevance_score:.3f}")
from zeroentropy importAsyncZeroEntropyclient = AsyncZeroEntropy() # uses ZEROENTROPY_API_KEY env varresponse = awaitclient.models.rerank(model="zerank-2",query="your query or criterion here",documents=["chunk 1 text","chunk 2 text","..."],)forresultinresponse.results:print(f"Document{result.index}:score={result.relevance_score:.3f}")
zerank-2 is also available as an open-weight model on HuggingFace for self-hosted deployments, with SOC 2 Type II and HIPAA-ready cloud options for regulated industries.
ZeroEntropy builds search and retrieval infrastructure for production AI pipelines.