penAI vs ZeroEntropy

/ openai-v3-large
Datasets won0 / 34zembed-1 > openai-v3-large
NDCG@10 lead+0.0 ptsabsolute, averaged over 28 datasets
Recall@100 lead+0.0 ptsfirst-pass recall, top-100
per-dataset NDCG@10 delta · bar height = margin · blue = ZE · tan = OpenAI

eroEntropy's zembed-1 outperforms OpenAI's text-embedding-3-large by 5-7 points Recall@100 across most verticals, and the gap widens once a reranker enters the pipeline — OpenAI ships no first-party reranker, so the honest production comparison is embed-only vs embed+rerank, not embed vs embed. text-embedding-3-large remains broadly suitable for general-purpose embedding tasks like classification and clustering where retrieval-specific tuning isn't the deciding factor. But for first-pass retrieval, which is what most teams actually need, zembed-1 is the model purpose-built for that workload — at materially lower per-query cost at production scale and with a first-party reranker pairing that closes a structural gap text-embedding-3-large simply doesn't have.

NDCG@10 lead +5.0 pts 0.721 vs 0.671
Recall@100 lead +5.9 pts 0.790 vs 0.731
Per-vertical Δ NDCG@10 (pts) sorted ZE-best → ZE-worst
Specialized
+12.6
Finance
+7.7
Instruction Following
+7.0
Medical
+6.3
Manufacturing
+5.4
Multilingual
+4.4
Legal
+4.3
QA & Knowledge
+1.6
Science
+1.6
OpenAI wins ← → ZE wins
Where the gap closes
  • Common compatibility — OpenAI embeddings are wired into more vector DBs, frameworks, and downstream tooling than anything else on this list, so the integration cost of staying put is effectively zero.
  • General-purpose embeddings beyond retrieval — text-embedding-3-large is broadly suitable for classification and clustering, where zembed-1 was specifically tuned for retrieval and may underweight other downstream tasks.
Where ZE wins
  • Recall@100 — 5-7 point gap across most verticals on the eval set.
  • The full retrieval pipeline — OpenAI ships no first-party reranker, so embed-only vs zembed-1 + zerank-2 is a single-stage-vs-two-stage comparison.
  • Per-query cost at production scale — zembed-1 is materially cheaper per million tokens at every tier.
Per-judge breakdown

Three independent judges, one verdict.

Gemini 3 Flashze leads
+0.0 pts NDCG@10
0 / 34 datasets won
zembed-1 0.728vsopenai-v3-large 0.680
GPT-5 Nanoze leads
+0.0 pts NDCG@10
0 / 34 datasets won
zembed-1 0.734vsopenai-v3-large 0.686
Grok 4 Fastze leads
+0.0 pts NDCG@10
0 / 34 datasets won
zembed-1 0.700vsopenai-v3-large 0.646
Unanimous · all 3 judges picked zembed-1

Each judge scored every passage independently — different model, different temperature, different system prompt — and we report the NDCG@10 their relevance grades produce. OpenAI's deltas don't depend on which judge you trust.

TL;DR · not yet authored

Headline deltas above are live from /evals/data/all-data.json; narrative copy will be added when the head-to-head blog post is published.

By vertical, by dataset

Drilling into the per-vertical roll-up.

Per-vertical NDCG@10 deltas, averaged across the datasets in each vertical. Blue numbers = zembed-1 wins the vertical; warm-tan numbers = OpenAI wins. Across 34 datasets head-to-head, zembed-1 wins on 30.

How we measure

No cherry-picking. No hand-tuned splits.

28 datasets

Heterogeneous coverage — legal, finance, medical, multilingual, instruction-following, long-context. Every model evaluated on the same set.

3 LLM judges

gemini-3-flash, gpt-5-nano, grok-4-fast. Continuous 0–10 relevance scores; inter-judge agreement (κ) ≥ 0.7 across the suite. See eval-set-quality for the discipline.

Paired bootstrap

Per-query deltas, not averaged independent samples. 95% CI on every reported number; statistical significance never asserted on n < 30.

All numbers on this page are sourced from /evals/. Latency figures use our open-source benchmark suite against public API endpoints.

ZeroEntropy
The best AI teams build with ZeroEntropy models
Follow us on
GitHubTwitterSlackLinkedInDiscord