- We open-sourced AutoOptimize, a live arena where 9 AI agents race to approximate a mystery function — each team searching 1,468 ArXiv papers with a different embedding model
- Same LLM, same corpus, same tools, same iteration budget — the only variable is the embedding model
- The zembed-1 team finishes with a best score of vs (OpenAI) and (Cohere) — a gap that grows, not shrinks, over 64 iterations
- In agentic workflows, embedding quality compounds: better retrieval surfaces better techniques, which unlock better solutions, which get refined further — each iteration amplifies the gap
- Clone the repo, add your API keys, watch the race: github.com/zeroentropy-ai/auto-optimize
Your Embedding Model Decides What Your Agent Can Think
Most evaluations of embedding models measure retrieval in isolation: given a query, how many relevant documents land in the top 10? That matters. But it misses something important about how embeddings actually get used in production.
In agentic workflows — where an LLM searches, reasons, acts, and iterates — the embedding model isn’t just a retrieval component. It’s the agent’s perception. It determines which techniques the agent discovers, which approaches it considers, and ultimately, what solutions it can build. A few percentage points of retrieval accuracy don’t just mean a few more relevant documents. They mean entirely different trajectories of reasoning.
We wanted to measure that compounding effect directly. So we built AutoOptimize.
The Result
Over 64 iterations, the three teams diverge — not gradually, but compoundingly. Team zembed-1 finds sharper techniques earlier, refines them further, and closes the race with a best-so-far score well ahead of teams using OpenAI text-embedding-3-small or Cohere embed-english-v3.
Team zembed-1 finishes at digits · √speed. Team OpenAI: . Team Cohere: . Same corpus, same LLM, same 64-iteration budget.
The pattern is stable across runs: zembed-1 retrieves papers on Chebyshev node placement, barycentric interpolation, and rational approximation when those are the techniques that matter, while the baselines tend to return adjacent-but-weaker material — general interpolation theory, introductory surveys, polynomial methods. By iteration 10, zembed-1’s team is often already refining barycentric interpolation with optimal node placement; the other teams are still iterating on techniques that cannot, in principle, handle the target function’s Runge-like spike or high-frequency burst.
The Setup: A Controlled Race
Three teams of AI agents compete to approximate an unknown function
| Team | Embedding Model | Dimensions | Agents |
|---|---|---|---|
| ZeroEntropy | zembed-1 | 2560 | 3 |
| OpenAI | text-embedding-3-small | 1536 | 3 |
| Cohere | embed-english-v3.0 | 1024 | 3 |
Same corpus. Same LLM (Gemini 3 Flash). Same tools. Same iteration budget. The only variable is which embedding model powers the search.
Each team runs 3 agents in parallel that share a notebook — they can see each other’s best code, scores, and findings. Every iteration, each agent decides whether to refine the current approach or search for something fundamentally different. The dashboard shows the race live, with score charts and function approximation plots updating in real time.
The Mystery Function
The target
f(x) = sin(x²·2 + 3x) · exp(-0.08x²) // chirp — frequency sweeps with x
+ 2/(1 + 100·(x-1.7)²) // sharp spike — Runge-like pole
+ 0.4·|sin(3x)| // periodic kinks — non-differentiable
+ 1.5·exp(-8·(x+3)²)·cos(25x) // localized burst — high-freq near x=-3
+ 0.6·tanh(15·(x-3.5)) // steep step — sigmoid transitionEach term is chosen to defeat a different naive approach: the chirp resists fixed-frequency bases, the Runge spike breaks equispaced polynomial interpolation, the absolute-value kinks violate smoothness assumptions, the localized burst punishes uniform sampling, and the sigmoid step creates a region that any global polynomial has to “spend” degrees of freedom on.
Why This Is a Fair Test
It’s worth emphasizing what is not varying:
- LLM: All 9 agents use Gemini 3 Flash Preview — same model, same temperature, same max steps
- Corpus: All teams search the same 69,602 chunks from the same 1,468 papers
- Tools: Identical
searchPapersandevalCodetools, same API signatures - Budget: 64 iterations per team, 64 sample points per evaluation
- Scoring:
score = digits × √speed, wheredigits = -log₁₀(mean_absolute_error)over 10,001 test points
The embedding model is the only degree of freedom. If one team consistently outperforms, it’s because their retrieval surfaces better techniques — and that’s a direct measurement of embedding quality in an agentic context.
Architecture: How It Works
AutoOptimize is built with Mastra (TypeScript AI agent framework), the Vercel AI SDK, and Next.js.
Data Pipeline
1,468 ArXiv papers on numerical methods are downloaded, chunked at newline boundaries into ~1,500-character segments (69,602 chunks total), and pre-embedded with all three providers. Embeddings are stored as binary files on disk (~1.4 GB total).
Race Initialization
The server loads all pre-computed embeddings into memory. When the user clicks “Start Race,” 9 Mastra agents launch in parallel — 3 per team. Each agent gets a system prompt, access to searchPapers and evalCode tools, and a shared team notebook.
Agent Loop
Each iteration: the agent calls generate() with up to 5 tool-use steps. It can search papers (embed query → cosine similarity → top-5 chunks) and evaluate code (run the approximation function, measure accuracy and speed). Agents within a team share their best code, scores, and search findings via a shared notebook.
Live Dashboard
The dashboard polls the server every second for score updates and every 3 seconds for function plot data. Score charts show best-so-far (solid lines) and raw iteration scores (dotted lines). Function plots overlay each team’s best approximation against ground truth.
Run It Yourself
AutoOptimize is fully open-source under Apache-2.0. Clone it, add your API keys, and watch the race live:
git clone https://github.com/zeroentropy-ai/auto-optimize
cd auto-optimize
# Download corpus data and pre-computed embeddings (~1.5 GB)
./download.sh
npm install
npm run build
NODE_OPTIONS="--max-old-space-size=8192" npx next start -p 3000 -H 0.0.0.0
Then open http://localhost:3000 and click “Start Race.”
The only thing you need is API keys for the three embedding providers and Gemini. The embedding scripts are idempotent — if you want to swap in a different model or re-chunk the corpus, they’ll resume from where they left off.
What This Means for Production
If your agents search a knowledge base, the embedding model isn’t a commodity input — it’s the bottleneck on what your agent can learn. A few points of retrieval accuracy, compounded over dozens of iterations, produce fundamentally different outcomes.
In agentic workflows, embedding quality doesn’t add — it multiplies.
This is why zembed-1 is built for accuracy first — the same property that puts it ahead of voyage-4 and harrier-27b on static retrieval benchmarks is the property that makes it compound through an agent loop. For the reranking side of the same thesis, see zerank-2. AutoOptimize is one way to see all of this end-to-end in a system you can run yourself.
Get Started
zembed-1 is available today through multiple deployment options:
from zeroentropy import ZeroEntropy
zclient = ZeroEntropy()
response = zclient.models.embed(
model="zembed-1",
input_type="query",
input="Chebyshev node placement for function approximation",
dimensions=2560,
encoding_format="float",
latency="fast",
)Documentation: docs.zeroentropy.dev
GitHub: github.com/zeroentropy-ai/auto-optimize
Get in touch: Discord community or contact@zeroentropy.dev
