Embedding models are trained on a particular distribution of input lengths — typically passage-shaped chunks of 100-500 tokens. If you embed a 50-token snippet, you’re operating below the training distribution; the embedding under-uses the model’s representational capacity and the resulting vector is noisier than it should be. Embed a 4000-token document into a model trained on 256-token passages and you get the opposite failure: the embedding is the average of many topically-distinct spans, and the cosine similarity to any specific query degrades because the relevant span is diluted.
The practical consequence is that swapping embedding models almost always means re-tuning chunk size. A model trained for long passages (Cohere embed-v3, gte-large with extended context) tolerates 1024-token chunks fine; a model trained for retrieval over Wikipedia-shaped passages (e5-small, bge-small) wants 256-512. The instinct to “set chunk size once and forget it” is exactly wrong — chunk size is a hyperparameter of the chunker-plus-embedder pair, not the chunker alone.
A second-order effect: chunk length distribution should be relatively uniform. Mix 50-token and 2000-token chunks in the same index and the cosine-similarity comparisons across them are no longer apples-to-apples — the long chunks tend to have lower-magnitude vectors after normalization simply because they’re embedding more topics at once.