SHARE
For years, CLMs and legaltech products relied on keyword search. BM25 indexes over contract repositories, Elasticsearch clusters for case law, exact-match filters for clause types. It worked well enough when lawyers knew the right terms to type.
That era is ending.
Legal AI products now need to answer questions like:
"Show me MSAs where limitation of liability excludes consequential damages."
"Do we have any customer contracts with a 30-day termination for convenience?"
"What is our standard position on assignment, and where have we accepted deviations?"
"Find all supplier agreements where we agreed to indemnify for IP infringement."
These are not keyword queries. They carry intent, implied definitions, and legal reasoning that surface terms alone cannot capture. The system needs to understand what the user means — and retrieve every document that satisfies it.
That is an embedding problem. And in legal, getting it wrong is not a UX issue. It is a liability.
Recall is the metric that matters in legal
In some consumer search products, precision can be the priority. Return ten great results from a million candidates and the user is satisfied. Missed results are invisible.
Legal retrieval is the opposite.
When a lawyer asks "which of our contracts contain this clause," the expected answer is all of them that do. A missed contract is not a relevance failure - it is a disclosure failure, a due diligence gap, or worse, a missed risk. The consequences of false negatives in legal are severe in ways they simply are not in e-commerce or content discovery.
This changes what you need from an embedding model.
You need a model that:
Retrieves every relevant document, not just the most obvious ones
Handles the paraphrase problem - the same clause can appear in dozens of formulations
Works across languages and jurisdictions without collapsing recall on non-English content
Scales economically across large corpora - enterprise contract repositories often run into the millions of documents
The biggest gap in most legal stacks today is not reranking precision. It is that the embedding stage is losing documents before the reranker even sees them. You can have the best reranker in the world; if the first-stage retrieval drops a key contract clause from the candidate set, nothing downstream fixes it.
Why keyword retrieval is not enough, and why this matters now
BM25 and its variants are still useful. For known-term queries - "find all contracts with Salesforce" - keyword search is reliable and fast.
The problem is legal language.
Limitation of liability clauses appear in a hundred different forms:
"In no event shall either party be liable for indirect, incidental, consequential, or punitive damages..."
"Neither party's aggregate liability shall exceed the fees paid in the prior twelve months..."
"Each party waives its right to claim consequential losses arising under or in connection with this agreement..."
All three say the same thing in legal substance. BM25 treats them as distinct documents with little overlap. An embedding model that has learned legal semantics treats them as near-identical.
The paraphrase problem is the core challenge of legal retrieval. Contracts are adversarially drafted to say the same things in different ways. Embedding quality - specifically the ability to map semantically equivalent clauses to nearby vectors regardless of surface form — is what separates a retrieval system that works in production from one that works in demos.
What makes an embedding model well-suited to legal search
Not all embedding models perform equally on legal content. General-purpose models trained primarily on web text often underperform on legal corpora for several reasons.
Domain vocabulary. Legal English uses common words in specific technical senses ("consideration," "material breach," "time is of the essence") and has dense references to defined terms. A model that treats "consideration" as a synonym for "thought" will produce poor embeddings for contract clauses.
Long, structured documents. Contracts are not paragraphs. They are structured documents with hierarchical numbering, defined term references, and cross-clause dependencies. Chunking strategy matters, but the embedding model also needs to capture clause-level meaning rather than collapsing to document-level averages.
Multilingual content. Cross-border legal work - M&A, international supply chain, cross-jurisdiction employment - means your corpus is not English only. French, German, Spanish, and Portuguese legal content is common in many enterprise repositories. An embedding model that degrades on non-English legal text loses recall in exactly the cases where lawyers need it most.
Scale. Enterprise legal corpora grow continuously. The economics of embedding at scale matter: a ten-figure difference in cost per million tokens, multiplied over millions of documents and regular reindexing runs, is not trivial.
Why we built zembed-1 for this
zembed-1 is our flagship embedding model. We built it to address precisely these failure modes in high-stakes retrieval.
A few things that matter for legal specifically:
Multilingual by design. zembed-1 supports 100+ languages with strong recall on non-English legal content. Cross-border legal work does not require maintaining separate embedding pipelines per language or accepting degraded recall on French or German contracts.
Optimized for retrieval recall. The model is text-focused and trained specifically for retrieval tasks, not general-purpose semantic similarity. The difference matters: a model trained to rank well on web document retrieval benchmarks will not necessarily maximize recall on contract clause retrieval. zembed-1 is built for the retrieve-everything problem, not the top-one-result problem.
Open-weight. Legal data is confidential. Client contracts, litigation materials, and privileged communications cannot be sent to a third-party API without careful legal and compliance review. zembed-1 model weights can be downloaded and self-hosted, meaning embeddings are computed entirely within your own infrastructure. OpenAI, Cohere, and Voyage do not offer this. For enterprise legal deployments with strict data governance, this is often a hard requirement rather than a preference.
Cost at scale. At $0.05 per million tokens, zembed-1 is significantly cheaper than comparable high-quality models (OpenAI text-embedding-3-large is $0.13/M, Cohere embed-v4.0 is $0.12/M). For a legal repository of 10 million document chunks, reindexed quarterly, that gap is meaningful.
Matryoshka dimension support. zembed-1 outputs 1,024-dimensional vectors by default, with Matryoshka support for reduction to smaller targets at inference time. If your vector store budget is constrained or you need to serve lower-latency approximate search on a large index, you can reduce dimensions without reembedding the corpus.
API latency. For interactive legal search - a lawyer typing a query and waiting for results - p90 API latency matters. zembed-1 delivers 115ms p90 on the hosted API, which keeps retrieval comfortably within budget for real-time assistant UX.
Three legal use cases where embedding recall is the critical path
1. Contract due diligence
The scenario: An M&A team needs to review 4,000 target company contracts for change of control provisions, assignment restrictions, and termination triggers before close.
The problem with keyword search: Change of control language appears as "change of control," "direct or indirect change in ownership," "acquisition," "merger," "transfer of a majority of voting securities," and dozens of other formulations depending on drafting style and vintage. BM25 misses most of them.
What high-recall embeddings unlock: A single query - "change of control or similar triggering events" - retrieves all semantically equivalent clauses regardless of surface formulation. Due diligence reviewers see the full universe, not the subset that happened to use a specific phrase. Missed provisions are the most expensive mistakes in M&A legal work. Recall is not a nice-to-have; it is the product requirement.
2. Clause analytics and playbook compliance
The scenario: A legal ops team wants to know whether the company's standard limitation of liability language is consistently enforced, or whether sales teams have accepted deviations in individual customer agreements.
The problem: Deviations are definitionally non-standard. If they used standard language, they would not be deviations. Keyword search on the standard clause finds the standard clause. It does not find the thirty agreements where the negotiated language expresses the same concept with different carve-outs or caps.
What high-recall embeddings unlock: An embedding-based search for "limitation of liability" retrieves every agreement that contains substantively equivalent language - standard or deviated. The analytics layer then compares what was retrieved to the playbook to surface deviations. This is not possible if the retrieval layer only returns agreements that match the exact playbook language. The value of the whole pipeline depends on recall at this first stage.
3. Legal research grounding for AI assistants
The scenario: A research assistant is answering "What are the enforceability requirements for non-competes in California for senior executives?"
The retrieval challenge: The relevant authorities include cases that address California Business and Professions Code 16600, cases that address at-will employment and restrictive covenants without explicitly mentioning non-competes, and secondary sources that synthesize the doctrine. A keyword search for "non-compete California enforceability" misses authorities that address the same doctrine under different terminology.
What high-recall embeddings unlock: Dense retrieval captures the full semantic space of enforceability doctrine, not just the subset that uses the same surface terms as the query. When the downstream LLM generates an answer, it grounds on a complete retrieval set rather than a partial one. The accuracy of the assistant output is directly downstream of retrieval recall. LLMs cannot reason their way to correct answers about cases they were never shown.
The full pipeline: zembed-1 + zerank-2
For most legal products, the highest-accuracy architecture is a two-stage pipeline:
Embed the corpus with
zembed-1.Compute dense vectors for every document chunk and store them in a vector database (Milvus, Pinecone, Qdrant, pgvector, or similar).At query time, retrieve broadly. Run the query against the vector index and pull back the top 50–200 candidates. Optimize this stage for recall: you want the relevant documents in this set, even at the cost of including some irrelevant ones.
Rerank with
zerank-2. Pass the top candidates and the query to our cross-encoder reranker.zerank-2reads each (query, document) pair jointly and produces calibrated relevance scores that reorder the list. This stage optimizes precision.Send only the top K to the LLM. The LLM sees a tight, relevant context window — fewer tokens, better grounding, lower hallucination risk.
Each stage does what it is best at. zembed-1 maximizes recall, it surfaces the full candidate set. zerank-2 maximizes precision, it orders that set so the top results are actually what the user needs. You do not have to choose between them.
Data governance and self-hosting
Legal AI deployment is not just a technical decision. It is a legal and compliance decision.
Most law firms and legal departments operate under confidentiality obligations that restrict where client data can be sent. Sending contract text to a third-party embedding API requires legal review, vendor risk assessment, and often explicit client consent.
Self-hosting eliminates this problem. Because zembed-1 is open-weight, you can download the model and run inference entirely within your own VPC or on-premise infrastructure. No text leaves your environment. The embedding pipeline becomes an internal service.
For teams that prefer managed API access, we offer an EU-region endpoint (eu-api.zeroentropy.dev) for GDPR-sensitive deployments where data must stay within European infrastructure.
Getting started
An API key is available at dashboard.zeroentropy.dev. zembed-1 is now generally available. For self-hosting, and fine-tuning services, contact us at support@zeroentropy.dev
The gap between keyword retrieval and high-recall embedding-based retrieval is the biggest leverage point in most legal stacks today. Every clause the embedding stage misses is a result the rest of your pipeline cannot recover.
Get started with
RELATED ARTICLES





