penAI vs ZeroEntropy
eroEntropy's zembed-1 outperforms OpenAI's text-embedding-3-large by 5-7 points Recall@100 across most verticals, and the gap widens once a reranker enters the pipeline — OpenAI ships no first-party reranker, so the honest production comparison is embed-only vs embed+rerank, not embed vs embed. text-embedding-3-large remains broadly suitable for general-purpose embedding tasks like classification and clustering where retrieval-specific tuning isn't the deciding factor. But for first-pass retrieval, which is what most teams actually need, zembed-1 is the model purpose-built for that workload — at materially lower per-query cost at production scale and with a first-party reranker pairing that closes a structural gap text-embedding-3-large simply doesn't have.
- Common compatibility — OpenAI embeddings are wired into more vector DBs, frameworks, and downstream tooling than anything else on this list, so the integration cost of staying put is effectively zero.
- General-purpose embeddings beyond retrieval — text-embedding-3-large is broadly suitable for classification and clustering, where zembed-1 was specifically tuned for retrieval and may underweight other downstream tasks.
- Recall@100 — 5-7 point gap across most verticals on the eval set.
- The full retrieval pipeline — OpenAI ships no first-party reranker, so embed-only vs zembed-1 + zerank-2 is a single-stage-vs-two-stage comparison.
- Per-query cost at production scale — zembed-1 is materially cheaper per million tokens at every tier.
Three independent judges, one verdict.
Each judge scored every passage independently — different model, different temperature, different system prompt — and we report the NDCG@10 their relevance grades produce. OpenAI's deltas don't depend on which judge you trust.
Headline deltas above are live from /evals/data/all-data.json;
narrative copy will be added when the head-to-head blog post is published.
Drilling into the per-vertical roll-up.
Per-vertical NDCG@10 deltas, averaged across the datasets in each vertical. Blue numbers = zembed-1 wins the vertical; warm-tan numbers = OpenAI wins. Across 34 datasets head-to-head, zembed-1 wins on 30.
No cherry-picking. No hand-tuned splits.
Heterogeneous coverage — legal, finance, medical, multilingual, instruction-following, long-context. Every model evaluated on the same set.
gemini-3-flash, gpt-5-nano, grok-4-fast. Continuous 0–10 relevance scores; inter-judge agreement (κ) ≥ 0.7 across the suite. See eval-set-quality for the discipline.
Per-query deltas, not averaged independent samples. 95% CI on every reported number; statistical significance never asserted on n < 30.
All numbers on this page are sourced from /evals/. Latency figures use our open-source benchmark suite against public API endpoints.
