penAI vs ZeroEntropy

/ openai-v3-large

Datasets won0 / 34zembed-1 > openai-v3-large

NDCG@10 lead+0.0 ptsabsolute, averaged over 28 datasets

Recall@100 lead+0.0 ptsfirst-pass recall, top-100

per-dataset NDCG@10 delta · bar height = margin · blue = ZE · tan = OpenAI

eroEntropy's zembed-1 outperforms OpenAI's text-embedding-3-large by 5-7 points Recall@100 across most verticals, and the gap widens once a reranker enters the pipeline — OpenAI ships no first-party reranker, so the honest production comparison is embed-only vs embed+rerank, not embed vs embed. text-embedding-3-large remains broadly suitable for general-purpose embedding tasks like classification and clustering where retrieval-specific tuning isn't the deciding factor. But for first-pass retrieval, which is what most teams actually need, zembed-1 is the model purpose-built for that workload — at materially lower per-query cost at production scale and with a first-party reranker pairing that closes a structural gap text-embedding-3-large simply doesn't have.

NDCG@10 lead +5.0 pts 0.721 vs 0.671

Recall@100 lead +5.9 pts 0.790 vs 0.731

Per-vertical Δ NDCG@10 (pts) sorted ZE-best → ZE-worst

Specialized

+12.6

Finance

+7.7

Instruction Following

+7.0

Medical

+6.3

Manufacturing

+5.4

Multilingual

+4.4

Legal

+4.3

QA & Knowledge

+1.6

Science

+1.6

OpenAI wins ← → ZE wins

Where the gap closes

Common compatibility — OpenAI embeddings are wired into more vector DBs, frameworks, and downstream tooling than anything else on this list, so the integration cost of staying put is effectively zero.
General-purpose embeddings beyond retrieval — text-embedding-3-large is broadly suitable for classification and clustering, where zembed-1 was specifically tuned for retrieval and may underweight other downstream tasks.

Where ZE wins

Recall@100 — 5-7 point gap across most verticals on the eval set.
The full retrieval pipeline — OpenAI ships no first-party reranker, so embed-only vs zembed-1 + zerank-2 is a single-stage-vs-two-stage comparison.
Per-query cost at production scale — zembed-1 is materially cheaper per million tokens at every tier.

Per-judge breakdown

Three independent judges, one verdict.

Gemini 3 Flashze leads

+0.0 pts NDCG@10

0 / 34 datasets won

zembed-1 0.728vsopenai-v3-large 0.680

GPT-5 Nanoze leads

+0.0 pts NDCG@10

0 / 34 datasets won

zembed-1 0.734vsopenai-v3-large 0.686

Grok 4 Fastze leads

+0.0 pts NDCG@10

0 / 34 datasets won

zembed-1 0.700vsopenai-v3-large 0.646

Unanimous · all 3 judges picked zembed-1

Each judge scored every passage independently — different model, different temperature, different system prompt — and we report the NDCG@10 their relevance grades produce. OpenAI's deltas don't depend on which judge you trust.

Start now Talk to an Engineer

Loading…

TL;DR · not yet authored

Headline deltas above are live from /evals/data/all-data.json; narrative copy will be added when the head-to-head blog post is published.

By vertical, by dataset

Drilling into the per-vertical roll-up.

Per-vertical NDCG@10 deltas, averaged across the datasets in each vertical. Blue numbers = zembed-1 wins the vertical; warm-tan numbers = OpenAI wins. Across 34 datasets head-to-head, zembed-1 wins on 30.

How we measure

No cherry-picking. No hand-tuned splits.

28 datasets

Heterogeneous coverage — legal, finance, medical, multilingual, instruction-following, long-context. Every model evaluated on the same set.

3 LLM judges

gemini-3-flash, gpt-5-nano, grok-4-fast. Continuous 0–10 relevance scores; inter-judge agreement (κ) ≥ 0.7 across the suite. See eval-set-quality for the discipline.

Paired bootstrap

Per-query deltas, not averaged independent samples. 95% CI on every reported number; statistical significance never asserted on n < 30.

All numbers on this page are sourced from /evals/. Latency figures use our open-source benchmark suite against public API endpoints.

See the raw eval data