IVF Clustering

Also known as: IVF, Inverted File Index, IVF-PQ, IVFFlat

TL;DR

IVF (Inverted File Index) is the cluster-based ANN algorithm: K-means partitions the corpus into a few thousand cells, each query is matched to its nearest centroids, then exhaustively searched within only those cells.

IVF (Inverted File Index) is the cluster-based algorithm and the canonical approach. K-means partitions the corpus into nlist cells once at build time. Each vector is assigned to its nearest centroid. At query time, the query is compared to each centroid, the top nprobe cells are picked, and only the vectors in those cells are searched exhaustively. The rest of the corpus is pruned without ever being touched.

INVERTED FILE INDEXCluster the corpus, scan a few cells.dim 1dim 2qCORPUSCENTROIDQUERYNPROBETOP-KSCANNED13 / 66vectors3 / 14cellsEMBEDDING SPACE · 2D PROJECTION

How it works

The algorithm has two phases.

Build. Run K-means on the corpus to find nlist centroids. For each vector, find its nearest centroid and add it to that cell’s posting list. The result is a Voronoi partition of the embedding space: each cell is the set of points closer to its centroid than to any other.

Query. Compute the query’s distance to each of the nlist centroids. Pick the top nprobe cells. Compute the query’s distance to every vector in those cells, return the top-k overall.

If nprobe = 1, you visit only the single closest cell — fast, low recall (the true nearest neighbor often sits just over a Voronoi boundary). If nprobe = nlist, you scan everything — exact search. In between is the recall/latency curve.

Why it’s memory-efficient

Unlike , IVF stores no graph. The index is just the centroids (small) and the inverted lists pointing to vectors (essentially free — one int per vector). At billion-scale where every byte of RAM matters, this is decisive. Combined with on the vectors themselves, the entire index for a billion-vector corpus can fit in tens of gigabytes instead of terabytes.

Picking nlist and nprobe

Sizing rules
  • nlist — start at sqrt(N). For 1M vectors, nlist=1024; for 100M, nlist=16384; for 1B, nlist=65536. Too low: each cell is huge, scan dominates. Too high: the query-vs-centroid step dominates and the right cell is more likely to be just outside the top-nprobe.
  • nprobe — start at nlist/16, sweep up. nprobe=8-64 hits 95%+ recall on well-clustered corpora. The curve is sharp — going from nprobe=8 to nprobe=32 typically jumps recall from 0.85 to 0.97.
  • K-means iterations — 20-30 for large corpora. More converges further but with diminishing returns.
  • Training subset — you don’t need to run K-means on the full corpus. A few percent (millions of training vectors at billion scale) is enough.

The standard mistake: pick nlist arbitrarily, then over-tune nprobe to compensate. Recall ceiling is set by nlist; once that’s right, nprobe is the latency dial.

Where the boundaries hurt

Consider a query that lies very close to the boundary between cell A and cell B. Its true nearest neighbor might be in cell B, but the centroid distance computation places cell A first. With nprobe=1, you visit A and miss the answer entirely. This is the dominant failure mode of IVF and the reason nprobe matters.

The geometry of K-means makes this worse near cluster boundaries — vectors near a boundary belong to whichever centroid happens to be slightly closer, which is fragile. Increasing nprobe reduces this risk linearly: nprobe=8 means a vector has to be more than 8 cells deep into the wrong region for you to miss it. For most natural distributions of embeddings, the right neighbor is within the top 4-16 cells, and the recall curve plateaus accordingly.

This is why HNSW often wins on small-corpus recall: the graph traversal smoothly walks across what would be cell boundaries, where IVF has to make discrete jumps.

IVF in practice (and where it dominates)

The full production stack is usually IVF + PQ + a rescore stage: cluster-prune with IVF, score with PQ-compressed vectors, then re-rank the top few hundred with full-precision distance. Each stage trims compute by a factor and recovers a slice of the recall lost upstream.

Where IVF struggles

  • Skewed corpora. If 80% of vectors land in 5% of cells, K-means is doing a bad job and the cells you visit are huge. Hierarchical or balanced K-means helps; some systems run a multi-level IVF.
  • Very high dimensions. Beyond ~2000 dims, distance concentration hurts cell assignment quality. truncation or on a lower-dim head fixes it.
  • Frequent inserts after build. New vectors join existing cells without re-running K-means; centroid quality drifts. Periodic re-clustering matters at scale.

IVF is the default when memory binds and the corpus is large; is the default when memory is plentiful and the corpus is medium-sized.

Go further

How many cells should I use?

Rule of thumb: nlist roughly equal to the square root of the corpus size. For 1M vectors, nlist=1024 is a reasonable start; for 100M, nlist=16384. Too few cells and each cell is huge (slow scan). Too many and centroid distance computation dominates and recall drops because the right cell is more likely missed.

What does nprobe control?

nprobe is the number of cells visited per query. The recall/latency dial. nprobe=1 visits only the single closest centroid (fast, low recall). nprobe=nlist scans everything (exact, slow). Typical production values: nprobe=8-64 to hit 95%+ recall on well-distributed corpora.

Why pair IVF with PQ specifically?

IVF gives you cell-level pruning; PQ gives you per-vector compression. Combined, you avoid scanning most of the corpus and keep the surviving vectors in compressed form. IVF-PQ is the canonical billion-scale ANN configuration in FAISS — RAM footprint shrinks 32x or more vs raw float vectors, with a small rescore stage recovering recall.

ZeroEntropy
The best AI teams build with ZeroEntropy models
Follow us on
GitHubTwitterSlackLinkedInDiscord