Chunking

Also known as: passage segmentation, document chunking

TL;DR

Chunking is the process of splitting long documents into smaller passages that fit cleanly inside an embedding model's context window — and that align with semantic boundaries so each chunk is independently retrievable.

Chunking splits a long document — a 50-page contract, a 200-page manual, a 30-minute meeting transcript — into pieces small enough that an can encode each one and a retrieval system can return individual chunks. The choices you make here cascade through every later stage of retrieval quality; bad chunking will ruin a state-of-the-art embedding model.

CHUNKING STRATEGIESThree ways to split the same documentDOCUMENT~2K TOKENSFIXED-SIZEevery N tokens12 CHUNKSSENTENCE-AWAREsplit on punctuation10 CHUNKSSEMANTICsplit where topic shifts4 CHUNKSintrobodyasideconclusionMEANING PER CHUNK

Why you can’t just embed the whole document

Embedding models have a fixed context length — 8K, 32K, 128K tokens depending on the model. Beyond it you either truncate (losing tail content) or chunk. Even when documents fit, embedding a whole document into one vector dilutes the signal: the relevant span gets averaged with everything else. Passage-sized chunks win on precision.

Chunking strategies, roughly best to worst

Strategies, best to worst
  • Semantic / topic-aware — split where the document changes topic. Either via heuristics (heading boundaries, paragraph breaks) or via a small model that detects topic shifts. Best results, more setup.
  • Recursive character splitter — pick a target size (~500 tokens), break on the largest available delimiter (paragraph, then sentence, then word). Most popular default. Reasonable.
  • Fixed-size with overlap — every N tokens, with M tokens of overlap so cross-boundary spans don’t get lost. Simple, robust, fine for many corpuses.
  • Sentence-level — every chunk is one sentence. Maximum precision but loses local context, more chunks to manage.

Knobs that actually matter

  • Chunk size — typical 256-1024 tokens. Larger = more context per chunk, fewer chunks total. Smaller = more precise retrieval, less context per hit.
  • Overlap — typical 10-20% of chunk size. Prevents key info from being split.
  • Section anchors — preserve heading hierarchy so the chunk knows it’s “Chapter 3 → Section 4 → Subsection 2”. Hugely improves reranker quality.

What goes wrong

  • Cutting mid-table or mid-code — devastating for technical content. Add code/table-aware splitting.
  • Losing document-level context — if a chunk says “this provision applies”, the reader of just that chunk has no idea what provision. Solution: prepend a brief document summary or section breadcrumb to each chunk.
  • Over-chunking — too many tiny chunks bloats the index and dilutes the relevance signal.

Embedding models are trained on a particular distribution of input lengths — typically passage-shaped chunks of 100-500 tokens. If you embed a 50-token snippet, you’re operating below the training distribution; the embedding under-uses the model’s representational capacity and the resulting vector is noisier than it should be. Embed a 4000-token document into a model trained on 256-token passages and you get the opposite failure: the embedding is the average of many topically-distinct spans, and the cosine similarity to any specific query degrades because the relevant span is diluted.

The practical consequence is that swapping embedding models almost always means re-tuning chunk size. A model trained for long passages (Cohere embed-v3, gte-large with extended context) tolerates 1024-token chunks fine; a model trained for retrieval over Wikipedia-shaped passages (e5-small, bge-small) wants 256-512. The instinct to “set chunk size once and forget it” is exactly wrong — chunk size is a hyperparameter of the chunker-plus-embedder pair, not the chunker alone.

A second-order effect: chunk length distribution should be relatively uniform. Mix 50-token and 2000-token chunks in the same index and the cosine-similarity comparisons across them are no longer apples-to-apples — the long chunks tend to have lower-magnitude vectors after normalization simply because they’re embedding more topics at once.

Go further

What chunk size should I actually start with?

256-512 tokens with ~15% overlap is a safe default for most prose. Push larger (768-1024) for technical reference docs where context per chunk matters; smaller (128-256) for FAQ-shaped content where each chunk is a self-contained answer.

How do I keep document-level context inside each chunk?

Prepend a breadcrumb — document title plus heading hierarchy — to every chunk before embedding. The bi-encoder gets the global anchor for free, and the reranker gets a much cleaner signal because chunks like 'this provision applies' aren't context-free anymore.

Does chunking still matter for long-context embedding models?

Yes. Even with a 32K context window, encoding a whole document into one vector dilutes the relevant span — the embedding becomes the average of everything in the doc, which is rarely what you're searching for. Passage-level chunks still win on precision.

ZeroEntropy
The best AI teams build with ZeroEntropy models
Follow us on
GitHubTwitterSlackLinkedInDiscord