Also known as: matryoshka embedding, MRL, nested embeddings
TL;DR
Matryoshka representation learning trains an embedding model so that prefixes of its output vector are themselves valid embeddings — letting you truncate from 2048 to 1024 to 512 dimensions at inference time without retraining.
The premise is simple: train an embedding model so that the first 64 dimensions of its 2048-dimensional output vector form a complete, usable 64-dimensional embedding. The first 128 do too. The first 256, 512, etc. By the time you read all 2048 dimensions you have the highest-quality version. Truncate at any of those nested points and you get a valid (but lower-quality) embedding cheaper.
It’s named after the Russian nested dolls because of the recursive structure — every truncated prefix is itself a complete doll.
Why teams want this
Storage and retrieval cost scale with embedding dimension. At a billion documents, 2048-dim float32 = 8 TB; the same index at 256-dim int8 = 256 GB. The same model gives you both options if it was trained Matryoshka-style — pick the dimensionality at indexing time based on cost/accuracy budget.
Where teams actually use MRL truncation
First-pass retrieval over billions of docs at 256-512 dims; rerank top candidates with full vectors
Mobile / edge inference where memory is the binding constraint
Cost-tiered tenants — free tier at 256 dims, paid at 1536, same backing model
Multi-stage cascades where a tiny dim count screens, larger dims confirm
Hybrid index designs where exact ANN runs at low dim and re-scoring uses full
Standard contrastive training uses one loss on the full output vector. MRL training computes the same contrastive loss on each prefix dimension and sums them — typically with equal weight, sometimes with a weighted schedule emphasizing smaller dims. The model is now optimizing the same alignment objective at multiple granularities simultaneously, which forces the most discriminative information into the lowest-index components. By the end of training, dimension 0 is the single most informative axis, dimension 1 is the second-most, and so on — a learned, soft form of PCA baked into the embedding head. The cost is a small concession in the highest-dim quality (the model has to share capacity), which is exactly the tradeoff zembed-1’s approach avoids.
The accuracy tradeoff
Conventional wisdom is that MRL is “lossless” — you get the same accuracy at 256 dims as you’d get from a natively 256-dim model. In practice that’s not quite true. ZeroEntropy’s evals (see the Matryoshka is dead blog post) found measurable accuracy loss versus a non-MRL model at the same target dim count. The loss is small but real, and it compounds as you go to very small dims.
ZeroEntropy’s alternative: client-side linear truncation
Instead of MRL, zembed-1 uses a fixed (non-trained) linear projection at inference time to truncate from 2560 to any of [2560, 1280, 640, 320, 160, 80, 40] dimensions. Because the projection is a pure linear transform applied client-side after the model output, there’s no MRL training cost and the full-dim accuracy is preserved.
The accuracy at smaller dim counts is better than what MRL achieves at the same dim count, because the underlying model wasn’t penalized for serving multiple dim heads during training.
When MRL still makes sense
Open-source embedding models that ship as MRL (e.g., OpenAI’s text-embedding-3-* series) are still useful — the truncation is built into the inference contract and you don’t need to know how it’s implemented. If your downstream system can only handle one dim count and you don’t run the embedder yourself, MRL is fine. If you control the inference pipeline and care about every NDCG point, an MRL-free model with explicit truncation tends to be better.
Go further
Why does linear truncation work as well as MRL?
Trained embedding signal lives on a low-dimensional manifold inside the full vector — the [Johnson-Lindenstrauss lemma](/concepts/johnson-lindenstrauss/) bounds how much pairwise distances can distort under projection. A fixed linear projection captures the same low-rank structure without forcing the model to serve multiple dim heads during training.
Truncation and quantization compose multiplicatively — a 2048-dim float32 vector at 8 KB can become a 256-dim int8 vector at 256 bytes (32× smaller), with each step trading a small accuracy hit for a big storage win. Most billion-scale indexes use both.
What's the actual production pattern with truncated embeddings?
Use the smallest dim count that survives your eval suite for first-pass retrieval, then re-score top candidates with a [cross-encoder reranker](/concepts/cross-encoder/) that doesn't care about your embedding dim count. The reranker recovers most of the precision lost to truncation.