✨ Join the Context Engineers Discord community for an exclusive talk with the ZeroEntropy founders this Friday!

"Contextual" Embedding Models Sure Seem to Struggle A Lot With Context

Oct 1, 2025

TL;DR

Embedding models were designed to solve the problem of polysemy (e.g., "bank" as "financial institution" vs. "riverbank" depending on context) and other contextual meaning problems, thus improving information retrieval, search, and other tasks.
Yet they still fail on many of the problems defining contextual meaning: fine-grained polysemy, compositional reasoning, statement negation/qualification, and intuiting user intent.
Even state-of-the-art embedding models still underperform on fine-grained polysemy, and display no meaningful improvement in compositional reasoning, negation/qualification, and user intent.
These failures are especially costly for AI agents, which can't self-correct like humans, and which are highly sensitive to the quality of provided context (indeed, to the point of retrieval often proving the bottleneck for such systems)
LLM-based rerankers like zerank-1 significantly improve retrieval quality on high-stakes, nuanced queries, solving all the above issues

Some Context on the Limits of Embedding Models

Contextual embedding models were initially designed to solve the problem of polysemy: understanding that the word "bank" means different things in the phrases "river bank" and "Chase Bank". In theory, vector embedding models utilize surrounding context to disambiguate meaning and capture nuanced semantic relationships. And while we're certainly far beyond the halcyon days of machine translation (famously translating 'hydraulic ram' as 'water sheep'), even SOTA embedding models in 2025 continue to struggle with what we will call fine-grained polysemy: where the same terminology carries subtly different technical meanings across adjacent domains, creating enough conceptual overlap to confuse embeddings but enough technical distinction to matter critically (e.g., "attention mechanism" in transformer models refers to learned weight distributions over input sequences, while in cognitive neuroscience it refers to selective neural processing of stimuli; both involve selective focus, both process information, yet understanding one doesn't help implement or understand the other).

Figure 1: Contextual Embedding models can handle basic polysemy, but consistently struggle with fine-grained polysemy in related disciplines

Beyond polysemy, embedding models also fail on:

Compositional Structure ("utilizing method A for task B and method X for task Y" vs "utilizing method X for task B and method A for task Y")
Negation and Qualification ("effective treatment for depression" vs "not effective treatment for depression despite therapy non-respondence")
Inferring User Intent ("why my neural network won't converge" as needing practical documents for troubleshooting pytorch or tensorflow libraries vs as needing an academic paper on the theory of gradient descent and model convergence properties)

These failures all share a common thread: embeddings compress language into fixed vectors that capture semantic similarity but lose the precise logical relationships that determine whether a document actually answers a query. While that may be good enough for some tasks, it is problematic in information retrieval generally and agentic AI data pipelines especially.

Note: Recent innovations in contextual embeddings — including Morris & Rush's Contextual Document Embeddings (CDE) and Voyage's *voyage-context-3* — help capture document-level context and corpus-specific term distributions, showing some improvements on domain-specific polysemy. However, empirical testing reveals they still underperform LLM-based rerankers on this task, while also continuing to fail on compositional reasoning, negation handling, and intent disambiguation. Contextualization teaches embeddings what terms mean in a corpus; it doesn't teach them to reason about what users actually want.

Why It Matters

So there are some semantic tasks on which embedding models (or at least cosine similarities on embeddings) fail. What does it actually matter?

As our blog post on rerankers makes clear, most modern search is a simple combination of keyword search (BM25/TF-IDF) and cosine similarity on vector embeddings — and this means that for any of the aforementioned failure cases, users will struggle to access the information they actually need, increasing frustration and decreasing effectiveness.

More recently, the rise in agentic AI solutions has made the centrality of this problem even clearer. The inability to properly retrieve context to feed to LLMs at runtime has led users to either:

Force-feed huge quantities of mostly irrelevant information to agents — in the hope that at least some of the documents needed to perform the task are included — or
Accept that their AI agent — given insufficient, irrelevant, or simply downright misleading context — will simply fail to solve any given problem a frustratingly large portion of the time.

In particular, enterprise and academic use cases demand precision at a different level entirely. When a biotech researcher searches for "CRISPR applications in germline editing" versus "CRISPR limitations in somatic cells," or when a legal team needs documents discussing "liability under contract breach" rather than "liability excluding contract disputes," the difference between relevant and irrelevant results is not a minor point of ontological dispute; it constitutes a serious business problem which saps human effectiveness and which can render AI agents nearly useless.

The following sections break down four key failure modes where even state-of-the-art embedding models fall short, and where more sophisticated retrieval methods become essential.

In particular, LLM-based rerankers are currently the only viable automated method of solving the problems of fine-grained polysemy, compositional reasoning, grammatical qualification, and user intent awareness in information retrieval and search.

Minimal Examples of Model Failures in All Four Categories

We created a "minimal pair" (or, minimal quintuplets, as it were) example for each of these four failure modes, so as to illustrate both the query types that traditional search fails on as well as the ubiquity and seeming innocence of said queries. Indeed, while embedding models can be trivially be made to fail on highly convoluted queries, but as the following sections will show, they can also fail on surprisingly straightfoward requests.

Polysemy

As mentioned prior, polysemy is the very problem that contextual embedding models were meant to solve, and on simpler versions of this problem, they do represent a huge improvement over prior methods. It is only when a keyword (or a set of keywords) are used across similar technical disciplines to refer to similar (yet functionally distinct) things that contextual embedding models fail.

Figure 2: SOTA Embedding Models, even the contextualized chunk embedding voyage-context-3, fail to perfectly discern the meaning of "transformers" and "NLP" ("natural language processing" and "nonlinear programming") in different contexts

In the above graphic, we see that every SOTA embedding model (OpenAI, Cohere, Voyage) fails on this quite trivial example regarding optimizing transformers for NLP. Both OpenAI's v3-large and Cohere's embed-v4.0 rate document B, a document on power grid transformer design (!) higher than a document actually relevant to ML and NLP. Voyage's contextualized chunk model correctly identifies document A as the most relevant, but still gives the edge to the document on power grid transformers and nonlinear programming ("nlp", as it were) over an actually partially relevant document on vision transformer models.

What this demonstrates is that rather than properly understanding the meaning of "transformer" as being disambiguated between ML and non-ML contexts, the embedding models appear to simply be disambiguating between similar prose styles (that of academic paper abstracts) and the absence or presence of a few keywords.

And this is hardly limited to this specific example of "transformers and nlp". It is extremely common for researchers in academic fields, as well as for professionals in private organizations, to develop terminology with nuance specific to that group and to that discipline or company. Indeed, in order for end users to be able to query for domain-specific problems in ways natural to them, an immediate and intuitive understanding of these subtle domain-specific differences for polysemous disambiguation is critical.

Put simply, if you can think of any time that you searched for a keyword intending to mean one thing, but retrieved a deluge of information on something else entirely, that is a time when an AI agent would ground to a screeching halt.

Compositional Reasoning

Another common and well-understood issue with contextual embedding models is their inability to process compositional reasoning. At best, they learn that the simultaneous presence of a few keywords implies a specific semantic relationship that should be reflected in the structure of the vector, but this is not precise. In human language, the presence of a single "not", or "mostly", or "despite this, it is still the case that" in a document can completely change the meaning of said document. Embedding models are unable to accurately distinguish this degree of fineness in grammar often even for very broad tasks (like binary classification).

Above we present another minimal example where all SOTA embeddings fail. The exact text of the misleading E_wrong_connection is given below:

Machine learning models for drug design are enhanced through integration with protein structural databases. Our approach uses convolutional neural networks to analyze chemical compound libraries, predicting ADMET properties for potential oncological therapeutics. We maintain an extensive database of experimentally-determined protein folding structures from the Protein Data Bank, focusing on cancer-related targets like p53 and EGFR. However, these analyses operate independently - the neural networks select drug candidates based purely on pharmacokinetic predictions, while molecular docking simulations against known protein structures serve only as post-hoc validation. The machine learning component does not predict protein folding to guide drug design

In layman's terms, the problem is that while there is machine learning here, and it is being put to the problem of cancer drug research (oncology), it is not predicting protein folding. Rather, it is being used to predict the general pharmacokinetic properties of substances (is it absorbed into the body, how is it metabolized, is it toxic, etc.)

To a researcher wanting to read the literature in the run up to designing their own experiment, or to understanding the current state of ML protein folding prediction, the incorrect documents retrieved here are not much better than random.

Negatives and Statement Qualification

Users naturally structure queries with conditional, qualified, and recursive statements to filter information precisely. We see this commonly with SQL queries or messages making use of formal logic, and its importance is such that many create two-stage retrieval mechanisms, one with normal search, the other for tag-based conditional filtering.

As an exempli gratia, lawyers searching for relevant case law to their client's lawsuits will often use a variety of chained boolean operations:

(discriminat! harass! "hostile work environment" retaliat!) /p 
(terminat! discharg! fir! "adverse employment action") /s 
(race racial color ethnic! national origin) AND 
(pretext! "but for" motivat! animus) /p 
("mixed motive" "legitimate reason" "business justification") AND 
DA(aft 2015) & CO(circuit /3 court) % 
(summary judgment "motion to dismiss") AND 
("McDonnell Douglas" "burden shifting" "prima facie")

While this syntax is certainly formalized, the underlying principle — chaining logical qualifiers to narrow results — appears throughout natural language queries. And, importantly, entire classes of users (such as lawyers and legal researchers) are accustomed to searching in this manner, and often express frustration when seeing this precise approach fail on alternate search engines (such as those using more traditional hybrid search).

LLM based rerankers, able to directly understand language, are unsurprisingly, not confused.

User Intent

For the specific task of query-document search/reranking, the implicit notion of user intent comes into the fore. A query may, in the abstract, be related to a document (for example, "how to fix my pytorch model that won't converge" may justifiably belong in a similar part of vector space to an academic document detailing theoretical convergence properties for certain ML architectures) and yet be of no use whatsoever to the user. Intuiting this user intent is not something that can be done without an internal world model, which necessitates intelligence and LLM-based solutions.

Even on simpler queries, such as "intro to quantum computing", LLM based rerankers are able to intuit based on the specific phrasing of the query what type of user is most likely to be asking the question, and therefore, what type of document is most likely of use. Continuing off this example, one might speculate that the most likely user would be a computer scientist, with experience in canonical CS and development. It would be unlikely for, example, for a middle school student or a particle physicist to write such a query — the former because they'd rather be on Roblox, the latter because they'd likely append "intro to quantum computing for physicists" (or similarly for engineers, mathematicians, etc). Documents giving an overview expecting prerequiste CS knowledge are thus going to be ranked most highly, whereas Quantum Computing For Babies or A Short Introduction to Quantum Computing For Physicists should reasonably be ranked lower, even though they should belong in nearly identical places in embedding space.

AI Agents Are More Sensitive to Ranking Errors

Many companies are either utilizing third-party AI agents on their private data, or directly developing such agentic systems for themselves and other enterprises. Unlike humans, however, who can quickly determine the irrelevance of presented documents and not be led astray, LLM performance is known to rapidly degrade on long contexts (so-called "context rot") and will of course suffer from not having the prerequisite information necessary to answer a given user query.

Combined with the fact traditional vector or hybrid search struggles to differentiate very intricate or niche queries (which are usually the ones with the highest value to users), this means that one of the principle bottlenecks in developing truly human-level AI agents is the ability of the RAG pipeline to properly find and provide appropriate context for the agentic LLM. As such, LLM-based rerankers provide massive gains in both consistency and tail performance for most agentic systems.

Your AI agent is only as good as the context it receives. When embeddings fail on the nuanced, high-stakes queries that matter most — queries about your company's specific domain, proprietary internal processes, or differentiating private dataset — your agent hallucinates, misunderstands, or (worse!) confidently delivers the wrong answer.

The gap between "pretty good" and "production-ready" in enterprise AI is not in the LLM or in the agentic loop; it is in the retrieval layer, and that is exactly where rerankers like Zerank-1 make the difference between an AI that merely impresses in demos and one that actually works when it counts.

Try It Yourself

Want to see these failure modes in your own retrieval pipeline? The examples in this post are available as a test suite comparing OpenAI embeddings against zerank-1. Run it on your own domain-specific content to see where embedding-only approaches fall short. All test cases and code available at: github.com/zeroentropy/zcookbook

Get started with

Our retrieval engine runs autonomously with the

accuracy of a human-curated system.

Our retrieval engine runs autonomously with the

accuracy of a human-curated system.

Our retrieval engine runs autonomously with the

accuracy of a human-curated system.

Start Now

View Docs

GitHub

Discord

Slack

Enterprise

LegalBench-RAG, the First Open-Source Retrieval Benchmark for the Legal Domain

Nov 29, 2024

LegalBench-RAG is the first open-source benchmark for legal RAG retrieval—6,800+ queries, 79M+ characters, human-annotated spans. Evaluate legal AI today.

LegalBench-RAG, the First Open-Source Retrieval Benchmark for the Legal Domain

Nov 29, 2024

LegalBench-RAG is the first open-source benchmark for legal RAG retrieval—6,800+ queries, 79M+ characters, human-annotated spans. Evaluate legal AI today.

LlamaChunk: A General and Cost Efficient Approach to Semantic Chunking

Dec 1, 2024

Learn how LlamaChunk delivers fast, accurate semantic chunking for RAG—outperforming regex and embedding methods with LLM-guided document splitting.

LlamaChunk: A General and Cost Efficient Approach to Semantic Chunking

Dec 1, 2024

Learn how LlamaChunk delivers fast, accurate semantic chunking for RAG—outperforming regex and embedding methods with LLM-guided document splitting.

AGI requires better retrieval, not just better LLMs

Dec 2, 2024

AGI needs more than LLMs—it needs smarter retrieval. Learn how to identify failure modes in RAG and evaluate search accuracy with ZeroEntropy’s benchmarks.

AGI requires better retrieval, not just better LLMs

Dec 2, 2024

AGI needs more than LLMs—it needs smarter retrieval. Learn how to identify failure modes in RAG and evaluate search accuracy with ZeroEntropy’s benchmarks.

LegalBench-RAG, the First Open-Source Retrieval Benchmark for the Legal Domain

Nov 29, 2024

LegalBench-RAG is the first open-source benchmark for legal RAG retrieval—6,800+ queries, 79M+ characters, human-annotated spans. Evaluate legal AI today.

LlamaChunk: A General and Cost Efficient Approach to Semantic Chunking

Dec 1, 2024

Learn how LlamaChunk delivers fast, accurate semantic chunking for RAG—outperforming regex and embedding methods with LLM-guided document splitting.

Abstract image of a dark background with blurry teal, blue, and pink gradients.

"Contextual" Embedding Models Sure Seem to Struggle A Lot With Context

SHARE

TL;DR

Some Context on the Limits of Embedding Models

Why It Matters

Minimal Examples of Model Failures in All Four Categories

Polysemy

Compositional Reasoning

Negatives and Statement Qualification

User Intent

AI Agents Are More Sensitive to Ranking Errors

Try It Yourself

Get started with

RELATED ARTICLES

LegalBench-RAG, the First Open-Source Retrieval Benchmark for the Legal Domain

LegalBench-RAG, the First Open-Source Retrieval Benchmark for the Legal Domain

LlamaChunk: A General and Cost Efficient Approach to Semantic Chunking

LlamaChunk: A General and Cost Efficient Approach to Semantic Chunking

AGI requires better retrieval, not just better LLMs

AGI requires better retrieval, not just better LLMs

LegalBench-RAG, the First Open-Source Retrieval Benchmark for the Legal Domain

LlamaChunk: A General and Cost Efficient Approach to Semantic Chunking