- zembed-1 leads MSMARCO with 0.946 NDCG@10 across all 16 models tested, with consistent quality across languages
- Over 50% non-English training data — multilingualism is foundational, not bolted on
- Cross-lingual retrieval out of the box: query in one language, retrieve documents in another
- Single model deployment for all languages — no translation pipelines, no language-specific models
- Averages 0.5561 NDCG@10 across all domain benchmarks, +10.1% over nearest competitor
The Best Multilingual Embedding Model
Most embedding models were built for English and then internationalized as an afterthought. They were trained primarily on English text, fine-tuned on some multilingual data, and shipped with claims of multilingual support that hold up reasonably well for Western European languages and collapse under the weight of anything more demanding.
zembed-1 by ZeroEntropy takes the opposite approach. Multilingualism is not a feature bolted on after the fact — it is foundational to the model’s design, with over half of all training data in languages other than English.
The result is the most capable multilingual embedding model available in 2026.
The Problem with “Multilingual” Embedding Models
When developers build multilingual AI systems, they quickly discover that most models claiming multilingual support have a dirty secret: their cross-lingual retrieval quality degrades significantly for non-English queries. A search query in Japanese, Arabic, or Swahili retrieves English documents well enough, but retrieval within non-English corpora, or across language pairs that don’t include English, often falls apart.
The underlying reason is training data imbalance. Most embedding models were trained on datasets where English comprises 80-90% of the corpus. The model learns excellent English semantic representations and mediocre representations for everything else. Cross-lingual alignment is approximate rather than precise.
- A customer support system in Japan retrieves the wrong FAQ entries for Japanese queries
- A legal AI in Germany misses relevant precedents because the query and document phrasings don’t align properly in German
- A clinical system in Brazil struggles with Portuguese medical terminology
- A multilingual RAG pipeline underperforms whenever users ask questions in anything other than English
zembed-1 was built to fix this.
How zembed-1 Achieves True Multilingual Parity
50%+ Non-English Training Data
More than half of all training data used to distill zembed-1 is in non-English languages — a deliberate design decision to ensure non-English users get the same quality of semantic retrieval as English speakers.
The model covers all major world languages, with particular attention to high-stakes multilingual deployment scenarios across European, Asian, Middle Eastern, and Latin American language families.
zELO: Relevance Calibration Across Languages
This means that a query in Arabic will retrieve relevant Arabic documents with the same accuracy as an English query retrieves English documents — not a degraded, approximate version of that accuracy.
Cross-Lingual Alignment Out of the Box
zembed-1 is designed for cross-lingual retrieval — the ability to match a query in one language to a relevant document in another. Enterprise systems frequently need this: a German-speaking analyst searching a database of English documents, or an English-language chatbot retrieving content from a Spanish knowledge base.
ZeroEntropy trained zembed-1 with “well-calibrated cross-lingual query-document pairs,” meaning the model’s Elo-trained relevance scores are aligned across language pairs. A relevant document is ranked as relevant whether the query and document share a language or not.
Performance: The Multilingual Benchmark Picture
zembed-1 leads the MSMARCO benchmark — the standard information retrieval benchmark and the closest proxy to real RAG workloads — with a score of 0.946 NDCG@10 across all 16 models tested. This top position holds across the multilingual dimensions of the evaluation, with zembed-1 delivering the same Elo-trained relevance judgement whether the query is in English, Japanese, Arabic, or any other major language.
On domain-specific benchmarks that include multilingual test sets, zembed-1 achieves:
| Domain | zembed-1 NDCG@10 |
|---|---|
| Finance (multilingual corpus) | 0.4476 |
| Healthcare (multilingual corpus) | 0.6260 |
| Legal (multilingual corpus) | 0.6723 |
| Conversational | 0.5385 |
| Average (all domains) | 0.5561 |
The model’s nearest competitor, voyage-4-nano, averages 0.5050 across the same benchmarks — a +10.1% deficit compared to zembed-1’s consistent performance.
Real-World Multilingual Use Cases
Global Customer Support
Power multilingual knowledge base retrieval so that customer queries in any language retrieve the most relevant support articles, regardless of whether they exist in translation. zembed-1’s cross-lingual capabilities mean you don’t need a separate model or translation pipeline for each language you support.
International Legal and Compliance
Retrieve regulatory documents across jurisdictions — EU GDPR guidance in German, French, and Italian; financial regulations in Japanese; labor law in Spanish — with consistent retrieval quality across all languages.
Multinational Enterprise Search
Organizations with operations across multiple countries accumulate documents in many languages. zembed-1 enables unified search across these polyglot corpora without language-specific indexes or translation overhead.
Multilingual RAG Applications
Build retrieval-augmented generation systems that serve users in their native language while drawing on knowledge bases that may be partially or entirely in other languages. zembed-1 handles the cross-lingual matching transparently.
E-Commerce Product Search
Enable customers to search product catalogs in their native language, retrieving relevant products whether the product descriptions are in the same language or not.
Academic and Scientific Research
Research is increasingly international. zembed-1 can search across papers in English, German, French, Chinese, Japanese, and other major scientific publishing languages simultaneously.
Practical Advantages Over Multilingual Alternatives
No translation pipeline needed: zembed-1’s cross-lingual alignment means you don’t need to translate queries or documents before embedding — eliminating both latency and translation errors.
Single model deployment: One model for all your languages, simplifying your infrastructure compared to language-specific model deployments.
Self-hostable for data sovereignty: The open-weight HuggingFace model can be deployed within your own infrastructure — important for organizations with data residency requirements in specific jurisdictions.
Flexible compression: zembed-1’s binary quantization compresses vectors to under 128 bytes, making large multilingual corpora tractable from an infrastructure standpoint.
from sentence_transformers import SentenceTransformer
model = SentenceTransformer(
"zeroentropy/zembed-1",
trust_remote_code=True,
model_kwargs={"torch_dtype": "bfloat16"},
)
# Cross-lingual retrieval: Japanese query, documents in English and Japanese
query_embeddings = model.encode_query(
"金利リスクの管理方法について教えてください" # "Please explain how to manage interest rate risk"
)
document_embeddings = model.encode_document([
"Interest rate risk is managed through duration matching and derivative hedging strategies...",
"金利リスクは、デュレーションマッチングとデリバティブヘッジ戦略によって管理されます...",
"Le risque de taux d'intérêt est géré par l'appariement de durée et les stratégies de couverture...",
])
similarities = model.similarity(query_embeddings, document_embeddings)
# zembed-1 correctly identifies and ranks relevant documents across all three languages # Build ONE multilingual index — no language detection, no routing, no translation
all_documents = {
"en": load_english_docs(),
"ja": load_japanese_docs(),
"ar": load_arabic_docs(),
"de": load_german_docs(),
"fr": load_french_docs(),
}
# Flatten and tag
corpus = []
metadata = []
for lang, docs in all_documents.items():
for doc in docs:
corpus.append(doc["text"])
metadata.append({"lang": lang, "id": doc["id"]})
# Single embedding pass — zembed-1 handles all languages uniformly
corpus_embeddings = model.encode_document(corpus, batch_size=64, show_progress_bar=True)
# Query in any language — retrieves across all languages
for query_text in [
"金利リスクの管理", # Japanese
"إدارة مخاطر أسعار الفائدة", # Arabic
"interest rate risk management", # English
]:
q_emb = model.encode_query(query_text)
scores = model.similarity(q_emb, corpus_embeddings)[0]
top_idx = scores.argsort(descending=True)[:5]
print(f"\nQuery ({query_text[:30]}...):")
for i in top_idx:
print(f" [{metadata[i]['lang']}] Score: {scores[i]:.4f} | {corpus[i][:80]}...") What Global AI Teams Are Saying
“We support 14 languages in our customer-facing AI. zembed-1 is the only model where the quality doesn’t visibly degrade when customers write to us in Arabic or Turkish.” — Head of AI Product, customer support company
The Bottom Line
The world does not speak English. AI systems that pretend otherwise leave the majority of the world’s population with degraded experiences and inferior results. zembed-1 was built from the ground up with multilingualism as a first-class concern — more than half its training data is non-English — and it shows in the benchmarks and in production deployments.
If you’re building AI systems that need to work in multiple languages, or work well for non-English-speaking users, zembed-1 is the only embedding model that takes multilingual performance as seriously as you do.
Get Started
zembed-1 is available today through multiple deployment options:
from zeroentropy import ZeroEntropy
zclient = ZeroEntropy()
response = zclient.models.embed(
model="zembed-1",
input_type="query", # "query" or "document"
input="What is retrieval augmented generation?", # string or list[str]
dimensions=2560, # optional: must be one of [2560, 1280, 640, 320, 160, 80, 40]
encoding_format="float", # "float" or "base64"
latency="fast", # "fast" or "slow"; omit for auto
)Documentation: docs.zeroentropy.dev
HuggingFace: huggingface.co/zeroentropy
Get in touch: Discord community or contact@zeroentropy.dev
Talk to us if you need a custom deployment, volume pricing, or want to see how zembed-1 + zerank-2 performs on your data.
