Smarter Context Compression for LLM Pipelines: zerank-2 as a Calibrated Classifier

Apr 1, 2026

SHARE

zerank-2 as a Calibrated Classifier

Large language models are expensive to run — and the cost scales directly with how much text you put in the context window. In many real-world pipelines, the bottleneck isn't the LLM's reasoning ability; it's the sheer volume of context you have to provide before the LLM can reason at all.

This is where ZeroEntropy's zerank-2 reranker opens up a pattern that goes well beyond traditional search: using a reranker as a calibrated binary classifier to decide, page by page or chunk by chunk, what actually belongs in your LLM's context.


What Is zerank-2?

zerank-2 is ZeroEntropy's state-of-the-art multilingual cross-encoder reranker. Cross-encoders differ from embedding models in a fundamental way: rather than independently encoding a query and a document into vectors and comparing them, a cross-encoder reads the query and the document together and outputs a single relevance score. This joint attention makes cross-encoders substantially more accurate — at the cost of being slower for large-scale retrieval, which is why they are typically applied as a second-stage reranker on a shortlist of candidates.

zerank-2 pushes the state of the art on several axes:

  • Instruction-following: The model accepts a natural-language instruction alongside the query, letting you inject domain context, terminology, or custom ranking criteria. A healthcare query for "acute kidney injury" can be told to treat "AKI" as a synonym, or to prioritize lab values over clinical notes.

  • Calibrated scores: This is the key property for the use case in this post. The model is trained so that a score of 0.8 means approximately 80% relevance — consistently, across query types and domains. The score is not just a relative ranking signal; it carries absolute probabilistic meaning.

  • Multilingual: Trained across 100+ languages with near-English performance, including challenging scripts and code-switching queries.

  • Fast and cheap: At $0.025 per 1M tokens — half the price of Cohere Rerank 3.5 — and with p50 latency around 130ms, it fits comfortably into production pipelines.


The Core Insight: A Reranker Score Is a Relevance Probability

Most people use rerankers to sort a list. But zerank-2's calibrated scores unlock a different usage pattern: thresholding.

For any query-document pair, the model produces a score in [0, 1]. Because of the calibration guarantee, you can interpret this score as a probability: "how likely is it that this chunk is relevant to this query?" That turns the reranker into a binary classifier:

score >= threshold  relevant    (include in context)
score <  threshold  not relevant    (discard)
score >= threshold  relevant    (include in context)
score <  threshold  not relevant    (discard)
score >= threshold  relevant    (include in context)
score <  threshold  not relevant    (discard)

This is significantly cheaper than asking an LLM to classify relevance, or to feed hundreds of thousands of tokens at every pass. A classification call to GPT-5 on a 2,000-token chunk costs roughly 80-100x more than a zerank-2 call on the same chunk. More importantly, zerank-2 can score hundreds of chunks in parallel with sub-200ms latency, while LLM classification calls serialize poorly (because of lack of calibration) and are far slower end-to-end.


Use Case: Context Compression for Long Clinical Documents

To make this concrete, consider the problem ZeroEntropy tackled with a leading healthcare company that automates clinical review of prior authorization requests.

The setup: A prior authorization request is a lengthy document — often 100–200 pages of clinical notes, lab results, imaging reports, and physician letters. A clinical reviewer must assess whether the patient meets a set of coverage criteria, each expressed as a structured question:

  • "Is there documentation of a failed trial of first-line therapy?"

  • "Does the patient have a confirmed diagnosis of moderate-to-severe disease?"

  • "Are there contraindications documented for alternative treatments?"

There may be 50–100 such criteria per review, and the relevant evidence for each criterion is scattered across just a few pages of the full document.

The naive approach: Send the entire document to an LLM for each criterion. A 150-page document at ~2,000 characters per page is ~300,000 characters of context — multiplied across 80 criteria, that is 24 million characters fed to the LLM per case. At scale this is both slow and prohibitively expensive.

The zerank-2 approach: Score every page of the document against each criterion question. Keep only the pages that score above a threshold (or the top-K pages). Send only those to the LLM.

from zeroentropy import AsyncZeroEntropy

zclient = AsyncZeroEntropy()

async def score_pages_for_criterion(
    pages: list[str],
    criterion_question: str,
    batch_size: int = 10,
) -> list[float]:
    """Score each page's relevance to a clinical criterion."""
    all_scores: list[float] = [0.0] * len(pages)

    for i in range(0, len(pages), batch_size):
        batch = pages[i : i + batch_size]
        response = await zclient.models.rerank(
            model="zerank-2",
            query=criterion_question,
            documents=batch,
        )
        for result in response.results:
            all_scores[i + result.index] = result.relevance_score

    return all_scores


def select_relevant_pages(
    pages: list[str],
    scores: list[float],
    threshold: float = 0.4,
) -> list[str]:
    """Return only pages that exceed the relevance threshold."""
    return [page for page, score in zip(pages, scores) if score >= threshold]
from zeroentropy import AsyncZeroEntropy

zclient = AsyncZeroEntropy()

async def score_pages_for_criterion(
    pages: list[str],
    criterion_question: str,
    batch_size: int = 10,
) -> list[float]:
    """Score each page's relevance to a clinical criterion."""
    all_scores: list[float] = [0.0] * len(pages)

    for i in range(0, len(pages), batch_size):
        batch = pages[i : i + batch_size]
        response = await zclient.models.rerank(
            model="zerank-2",
            query=criterion_question,
            documents=batch,
        )
        for result in response.results:
            all_scores[i + result.index] = result.relevance_score

    return all_scores


def select_relevant_pages(
    pages: list[str],
    scores: list[float],
    threshold: float = 0.4,
) -> list[str]:
    """Return only pages that exceed the relevance threshold."""
    return [page for page, score in zip(pages, scores) if score >= threshold]
from zeroentropy import AsyncZeroEntropy

zclient = AsyncZeroEntropy()

async def score_pages_for_criterion(
    pages: list[str],
    criterion_question: str,
    batch_size: int = 10,
) -> list[float]:
    """Score each page's relevance to a clinical criterion."""
    all_scores: list[float] = [0.0] * len(pages)

    for i in range(0, len(pages), batch_size):
        batch = pages[i : i + batch_size]
        response = await zclient.models.rerank(
            model="zerank-2",
            query=criterion_question,
            documents=batch,
        )
        for result in response.results:
            all_scores[i + result.index] = result.relevance_score

    return all_scores


def select_relevant_pages(
    pages: list[str],
    scores: list[float],
    threshold: float = 0.4,
) -> list[str]:
    """Return only pages that exceed the relevance threshold."""
    return [page for page, score in zip(pages, scores) if score >= threshold]

The result is a compressed context containing only the pages with evidence relevant to that specific criterion — typically 3–10 pages instead of 150. The LLM then reasons over a manageable, high-signal context.


The Recall vs. Context Size Tradeoff

The key question is: how much context do you need to preserve before you've captured essentially all the relevant evidence?

The chart below shows Recall@K (fraction of ground-truth relevant pages captured) against total characters seen, measured across a dataset of real prior authorization documents with human-annotated ground truth.

The curve has a characteristic shape: recall rises steeply at first, then flattens as you include more pages. In practice, the top 10–20 pages by zerank score capture the vast majority of ground-truth relevant pages across all criteria — often 90%+ recall at less than 15% of the total document characters.

This is the fundamental compression win: you can discard 85%+ of document characters while preserving 90%+ of the relevant content.


Setting the Threshold

Choosing a threshold is a one-time calibration step, and zerank-2's calibrated scores make it interpretable rather than arbitrary.

Option 1: Fixed threshold based on semantics

Because zerank-2's scores are calibrated probabilities, you can pick a threshold with direct semantic meaning:

Threshold

Meaning

0.2

Include anything with 20%+ chance of relevance (high recall, lower precision)

0.4

Balanced — good default for most RAG pipelines

0.6

Include only clearly relevant content (high precision, some recall loss)

0.8

Very conservative — near-certain relevance only

Start at 0.4 for most applications and adjust based on downstream LLM performance.

Option 2: Calibrate on a labeled validation set

If you have ground-truth labels, you can directly optimize the threshold against recall:

def find_threshold_for_recall_target(
    scores_per_criterion: list[list[float]],
    ground_truth_pages: list[list[int]],
    target_recall: float = 0.95,
) -> float:
    """
    Find the lowest threshold that achieves target_recall
    on a labeled validation set.
    """
    best_threshold = 0.0

    for threshold in [t / 100 for t in range(0, 100)]:
        recalls = []
        for scores, gt_pages in zip(scores_per_criterion, ground_truth_pages):
            selected = {i + 1 for i, s in enumerate(scores) if s >= threshold}
            gt = set(gt_pages)
            recall = len(selected & gt) / len(gt) if gt else 1.0
            recalls.append(recall)

        avg_recall = sum(recalls) / len(recalls)
        if avg_recall >= target_recall:
            best_threshold = threshold
            break

    return best_threshold
def find_threshold_for_recall_target(
    scores_per_criterion: list[list[float]],
    ground_truth_pages: list[list[int]],
    target_recall: float = 0.95,
) -> float:
    """
    Find the lowest threshold that achieves target_recall
    on a labeled validation set.
    """
    best_threshold = 0.0

    for threshold in [t / 100 for t in range(0, 100)]:
        recalls = []
        for scores, gt_pages in zip(scores_per_criterion, ground_truth_pages):
            selected = {i + 1 for i, s in enumerate(scores) if s >= threshold}
            gt = set(gt_pages)
            recall = len(selected & gt) / len(gt) if gt else 1.0
            recalls.append(recall)

        avg_recall = sum(recalls) / len(recalls)
        if avg_recall >= target_recall:
            best_threshold = threshold
            break

    return best_threshold
def find_threshold_for_recall_target(
    scores_per_criterion: list[list[float]],
    ground_truth_pages: list[list[int]],
    target_recall: float = 0.95,
) -> float:
    """
    Find the lowest threshold that achieves target_recall
    on a labeled validation set.
    """
    best_threshold = 0.0

    for threshold in [t / 100 for t in range(0, 100)]:
        recalls = []
        for scores, gt_pages in zip(scores_per_criterion, ground_truth_pages):
            selected = {i + 1 for i, s in enumerate(scores) if s >= threshold}
            gt = set(gt_pages)
            recall = len(selected & gt) / len(gt) if gt else 1.0
            recalls.append(recall)

        avg_recall = sum(recalls) / len(recalls)
        if avg_recall >= target_recall:
            best_threshold = threshold
            break

    return best_threshold

This gives you a principled threshold tied to a specific recall guarantee — for example, "include all content needed to answer 95% of criteria correctly."

Option 3: Top-K instead of threshold

If your downstream pipeline has a hard context budget (e.g. a 32K token limit), use top-K selection rather than a threshold:

def select_top_k_pages(
    pages: list[str],
    scores: list[float],
    k: int = 10,
) -> list[str]:
    ranked = sorted(enumerate(scores), key=lambda x: -x[1])
    top_indices = {i for i, _ in ranked[:k]}
    # Return in original order to preserve document flow
    return [page for i, page in enumerate(pages) if i in top_indices]
def select_top_k_pages(
    pages: list[str],
    scores: list[float],
    k: int = 10,
) -> list[str]:
    ranked = sorted(enumerate(scores), key=lambda x: -x[1])
    top_indices = {i for i, _ in ranked[:k]}
    # Return in original order to preserve document flow
    return [page for i, page in enumerate(pages) if i in top_indices]
def select_top_k_pages(
    pages: list[str],
    scores: list[float],
    k: int = 10,
) -> list[str]:
    ranked = sorted(enumerate(scores), key=lambda x: -x[1])
    top_indices = {i for i, _ in ranked[:k]}
    # Return in original order to preserve document flow
    return [page for i, page in enumerate(pages) if i in top_indices]

Top-K and threshold approaches can be combined: take up to K pages, but only those above a minimum threshold score, ensuring you don't pad context with low-relevance material when K pages aren't needed.


Beyond Context Compression: General Binary Classification

The context compression use case is specific to RAG, but the underlying pattern — zerank-2 as a binary classifier — generalizes broadly.

Document routing: In a multi-stage pipeline, use zerank to decide which documents warrant expensive downstream processing. Score each document against a query or policy description; route only those above 0.5 to an LLM for detailed analysis.

Duplicate / near-duplicate detection: Frame it as a relevance query: "Is this document essentially the same as the reference?" A high score flags near-duplicates.

Content moderation / policy compliance: Score content against a policy description query. "Does this text contain instructions that could cause harm?" with a low threshold catches borderline cases for human review.

Multi-label classification: Run one zerank call per label/class. Each call is cheap; the calibrated scores give you a probability per class that you can threshold independently.

async def classify_document(
    document: str,
    class_descriptions: dict[str, str],
    threshold: float = 0.5,
) -> dict[str, bool]:
    """
    Classify a document against multiple categories using zerank-2.
    Each category is described as a natural language query.
    """
    results = {}
    for class_name, description in class_descriptions.items():
        response = await zclient.models.rerank(
            model="zerank-2",
            query=description,
            documents=[document],
        )
        score = response.results[0].relevance_score
        results[class_name] = score >= threshold

    return results


# Example: clinical document triage
categories = {
    "contains_lab_results": "laboratory test results, blood work, or diagnostic measurements",
    "contains_imaging": "radiology report, MRI, CT scan, X-ray, or imaging findings",
    "contains_diagnosis": "diagnosis, clinical assessment, or documented medical condition",
    "contains_medication": "medication, prescription, dosage, or drug therapy",
}

labels = await classify_document(clinical_note, categories, threshold=0.4)
async def classify_document(
    document: str,
    class_descriptions: dict[str, str],
    threshold: float = 0.5,
) -> dict[str, bool]:
    """
    Classify a document against multiple categories using zerank-2.
    Each category is described as a natural language query.
    """
    results = {}
    for class_name, description in class_descriptions.items():
        response = await zclient.models.rerank(
            model="zerank-2",
            query=description,
            documents=[document],
        )
        score = response.results[0].relevance_score
        results[class_name] = score >= threshold

    return results


# Example: clinical document triage
categories = {
    "contains_lab_results": "laboratory test results, blood work, or diagnostic measurements",
    "contains_imaging": "radiology report, MRI, CT scan, X-ray, or imaging findings",
    "contains_diagnosis": "diagnosis, clinical assessment, or documented medical condition",
    "contains_medication": "medication, prescription, dosage, or drug therapy",
}

labels = await classify_document(clinical_note, categories, threshold=0.4)
async def classify_document(
    document: str,
    class_descriptions: dict[str, str],
    threshold: float = 0.5,
) -> dict[str, bool]:
    """
    Classify a document against multiple categories using zerank-2.
    Each category is described as a natural language query.
    """
    results = {}
    for class_name, description in class_descriptions.items():
        response = await zclient.models.rerank(
            model="zerank-2",
            query=description,
            documents=[document],
        )
        score = response.results[0].relevance_score
        results[class_name] = score >= threshold

    return results


# Example: clinical document triage
categories = {
    "contains_lab_results": "laboratory test results, blood work, or diagnostic measurements",
    "contains_imaging": "radiology report, MRI, CT scan, X-ray, or imaging findings",
    "contains_diagnosis": "diagnosis, clinical assessment, or documented medical condition",
    "contains_medication": "medication, prescription, dosage, or drug therapy",
}

labels = await classify_document(clinical_note, categories, threshold=0.4)

The key advantage over an LLM classifier: zerank-2 processes all categories in parallel, requires no prompt engineering for output formatting, and produces calibrated probabilities rather than token-sampled yes/no answers. For a document with 10 categories, a zerank-2 classification is roughly 50–100x cheaper than an equivalent GPT-5 classification.


Putting It Together

The pattern zerank-2 enables in these pipelines is simple but powerful:

  1. Express your classification task as a relevance query. What would a relevant chunk look like? Describe it in natural language.

  2. Score your candidates. Run zerank-2 over all chunks, pages, or documents. This is fast and cheap.

  3. Apply a threshold. Use the calibrated score to make a binary keep/discard decision. The score's probabilistic interpretation makes threshold selection principled.

  4. Send only what matters to the LLM. The LLM sees a high-signal, compressed context and reasons more accurately — at a fraction of the cost.

zerank-2 is not a replacement for LLMs. It is a fast, cheap filter that makes LLMs more effective by ensuring they spend their compute on content that actually matters.


Getting Started

ZeroEntropy offers a free 2-week trial with 1,000 queries.

from zeroentropy import AsyncZeroEntropy

client = AsyncZeroEntropy()  # uses ZEROENTROPY_API_KEY env var

response = await client.models.rerank(
    model="zerank-2",
    query="your query or criterion here",
    documents=["chunk 1 text", "chunk 2 text", "..."],
)

for result in response.results:
    print(f"Document {result.index}: score={result.relevance_score:.3f}")
from zeroentropy import AsyncZeroEntropy

client = AsyncZeroEntropy()  # uses ZEROENTROPY_API_KEY env var

response = await client.models.rerank(
    model="zerank-2",
    query="your query or criterion here",
    documents=["chunk 1 text", "chunk 2 text", "..."],
)

for result in response.results:
    print(f"Document {result.index}: score={result.relevance_score:.3f}")
from zeroentropy import AsyncZeroEntropy

client = AsyncZeroEntropy()  # uses ZEROENTROPY_API_KEY env var

response = await client.models.rerank(
    model="zerank-2",
    query="your query or criterion here",
    documents=["chunk 1 text", "chunk 2 text", "..."],
)

for result in response.results:
    print(f"Document {result.index}: score={result.relevance_score:.3f}")

zerank-2 is also available as an open-weight model on HuggingFace for self-hosted deployments, with SOC 2 Type II and HIPAA-ready cloud options for regulated industries.


ZeroEntropy builds search and retrieval infrastructure for production AI pipelines.

Get started with

ZeroEntropy Animation Gif

Our retrieval engine runs autonomously with the 

accuracy of a human-curated system.

Our retrieval engine runs autonomously with the 

accuracy of a human-curated system.

Contact us for a custom enterprise solution with custom pricing

Contact us for a custom enterprise solution with custom pricing

Abstract image of a dark background with blurry teal, blue, and pink gradients.