Embedding Quantization

Also known as: int8 quantization, binary embeddings, vector compression

TL;DR

Quantization compresses each dimension of an embedding from 32-bit floats down to smaller representations — typically int8 (4× smaller) or single-bit binary (32× smaller) — to shrink index size and speed up similarity search.

32 bits → 8 bits → 1 bit1 VECTOR · 1024 DIMSFP32full precision32 bits / dim · 4 KB totalINT8rescaled int8 bits / dim · 1 KB · 4× smallerBINARY1-bit sign1 bit / dim · 128 B · 32× smallerACCURACY100%fp32100%int899%binary92%STORAGE PER VECTOR8 KB1 KB128 B64× less storage · 92% of the accuracy

A 2048-dimensional embedding in float32 is 8 KB per vector. At a billion documents that’s 8 TB just for the dense vectors. Quantization shrinks this by changing the per-dimension representation:

  • float32 — full precision, 4 bytes/dim. The default.
  • float16 / bfloat16 — half precision, 2 bytes/dim, 2× smaller. Tiny accuracy loss.
  • int8 — 8-bit integers, 1 byte/dim, 4× smaller. Small accuracy loss.
  • int4 — 4-bit integers, 0.5 bytes/dim, 8× smaller. Larger accuracy loss; viable for some models.
  • binary (1-bit) — sign of each dimension only, 1 bit/dim, 32× smaller. Significant accuracy loss but useful for first-pass filtering.

How int8 works

For each dimension, you find the min/max across the corpus and rescale that range to [-128, 127]. Comparing two int8 vectors is just an integer dot product, which is faster than float on most hardware. Modern ANN libraries (FAISS, ScaNN, HNSW) all support int8 natively.

Accuracy loss versus float32 is usually 1-3% on retrieval benchmarks for well-trained embedding models. For most production use cases that’s a fine trade for 4× index reduction.

The naive approach uses a single global min/max across the whole corpus, which wastes precision: most dimensions are tightly distributed, but one or two outlier dims with rare extreme values stretch the global range and squash everything else into a few of the 256 buckets. Per-dimension calibration gives each dim its own [min, max] mapping, so the int8 buckets are evenly spent on the values that actually occur. The loss budget stays the same; the loss is just spread sensibly. This is the difference between a 1% NDCG hit and a 5% one on the same model.

How binary works

Each dimension is collapsed to 0 or 1 (or +1/-1) based on the sign or whether it exceeds a threshold. Comparison is Hamming distance — count the bits that differ — which is extremely cheap (XOR + POPCOUNT). A billion-document index becomes 256 GB in 2048-dim binary; same in float32 is 8 TB.

But: accuracy loss is significant, often 5-15% on benchmarks. Binary is best used as a first-pass filter — Hamming-rank a million candidates fast, then re-score the top thousand with the full float32 vectors. Compound retrieval.

Quantization is a recall-vs-precision lever, not a quality lever. Binary preserves recall@1000 cheaply; the reranker on top recovers precision. The composition wins; either alone loses.

Storage at one billion 2048-dim vectors
  • float32: 8 KB per vector, 8 TB total — usually impossible to fit in RAM
  • float16: 4 KB per vector, 4 TB total — fits a sharded fleet
  • int8: 2 KB per vector, 2 TB total — common production sweet spot
  • int8 + 256-dim Matryoshka truncation: 256 B per vector, 256 GB — single-node feasible
  • binary: 256 B per vector, 256 GB — first-pass filter scale
  • binary + 256-dim truncation: 32 B per vector, 32 GB — fits in CPU L3 cache for some workloads

Combining with truncation

Quantization composes with dimension truncation ( or just slicing). A 2048-dim float32 vector at 8 KB → truncate to 256 dims at 1 KB → quantize to int8 at 256 bytes → quantize to binary at 32 bytes. Each step buys storage at some accuracy cost; the curve is well-behaved. Modern embedding models (zembed-1, Matryoshka-trained variants of OpenAI text-embedding-3 and Cohere embed-v3) expose dim truncation and quantization as per-call knobs without retraining.

Go further

Why doesn't binary quantization destroy retrieval quality?

[Orthogonality concentration](/concepts/orthogonality-concentration/) means most of the per-dimension variance in trained embeddings is sign-aligned with the meaningful signal — so collapsing to the sign bit preserves the dominant geometry. The accuracy hit is real but smaller than naive intuition suggests, and a reranker can recover the rest.

How do I pick a quantization level for production?

Run your retrieval eval at float32, int8, and binary — measure NDCG@10 and recall@100 separately, since binary often holds recall while losing precision. Use binary as a first-pass filter, then re-score the top candidates at float32 or with a reranker.

Does quantization compose with dimension truncation?

Yes, multiplicatively — 2048-dim float32 (8 KB) → 256-dim int8 (256 B) → 256-dim binary (32 B) is a 256× compression, and each step is a smooth accuracy curve rather than a cliff. Most billion-scale production indexes use both.

ZeroEntropy
The best AI teams build with ZeroEntropy models
Follow us on
GitHubTwitterSlackLinkedInDiscord