Context Rot

Also known as: lost in the middle, lost-in-the-middle, positional attention bias, middle-context degradation, U-shaped attention, context dilution, effective-context degradation

TL;DR

Context rot is the empirical degradation of an LLM's effective recall and instruction-following as its context window fills. The canonical case is the U-shaped position bias first quantified by Liu et al. (2023) as 'lost in the middle' — facts near the start and end of a long prompt are used, facts buried in the middle are often ignored — but the phenomenon generalizes to attention dilution and instruction drift across long contexts.

Context rot is the umbrella term for the empirical degradation of an LLM’s effective recall and instruction-following as its fills. The canonical case is the U-shaped position bias first quantified by Liu et al. (2023) in a paper titled Lost in the Middle: facts placed near the beginning or end of a long prompt get used, while facts buried in the middle are frequently ignored, even when they are the correct answer.

The original “lost in the middle” finding is a specific instance of context rot. The broader phenomenon also includes cumulative attention dilution (overall recall drops as the prompt grows, not just for middle positions), instruction drift (a system prompt’s directives are followed less faithfully as the conversation lengthens), and citation slip (later sentences in a long answer point to wrong document indices). All share the same operational signature: a model that performs well at small context lengths degrades non-uniformly as the context grows.

What the original experiments showed

The Liu et al. setup was a multi-document QA task: place K documents in the context, one of which contains the answer, and ask the question. By varying the position of the gold document from slot 1 to slot K, the researchers traced an accuracy curve that consistently looked like a U — high at the boundaries, sagging in the middle, often dropping 20+ absolute points.

The result held across GPT-3.5, GPT-4, Claude, and the open-source models of the time. Larger context windows did not fix it. Models advertised as “100K context” still showed strong position bias when the relevant fact was at, say, position 50% inside a 30K prompt.

An LLM’s context window has high-attention zones at the boundaries and a low-attention zone in the middle. A document that is technically in the prompt is not automatically a document the model will use.

Why it happens

The mechanisms are partly architectural and partly training-driven:

  • Positional encoding decay. Most long-context models use or similar relative-position encodings whose effective attention sharpness decays at large relative distances. Tokens far from both the start (system prompt) and the end (user query) receive less attention bandwidth.
  • Training-data shape. Pretraining and instruction-tuning corpora over-represent short documents with important content at the start; the model implicitly learns “look at the start.” Recent instruction examples often place the user query at the end, training “look at the end” too. The middle has no analogous prior.
  • Recency bias of autoregressive generation. The next-token distribution is dominated by tokens close to the generation point, which favors the end of the prompt.

The U-shape is the sum of these forces. Newer long-context training recipes (needle-in-a-haystack training, position-balanced fine-tuning) flatten the curve but do not eliminate it.

Practical consequences for RAG

Where context rot matters in production
  • A reranker correctly puts the relevant chunk at top-7 of 10, but the LLM ignores it because slot 7 sits in the middle of the context.
  • A long system prompt pushes retrieved context into the middle of the window where attention is weakest.
  • Multi-document summarization quietly drops content from documents in the middle of the input order.
  • Few-shot examples placed in the middle of a long prompt influence outputs less than examples at the start or end.

The standard fixes:

  1. Reorder context by confidence. Place the highest-scoring retrieved chunks at the start and end of the context block, lower-confidence chunks in the middle. Needs calibrated reranker scores to work — see .
  2. Shrink the context. Context rot is a long-context problem; if the answer can be assembled from 4K tokens instead of 32K, the U-shape barely matters.
  3. Re-prompt with focused context. Run retrieval, identify the top 1-3 chunks, run a second LLM call with only those chunks in the prompt. Adds latency but eliminates position bias on the final answer.

No. Longer context is strictly more capable when the relevant information is at the boundaries; it is worse than equivalent shorter context only when the relevant information is buried in the middle. The honest summary is that long-context models have non-uniform recall across position, not lower recall everywhere.

For pipelines that can position their context strategically (RAG with reranking, structured prompts where the user query is always last), long context is usually a net win. For pipelines that drop a wall of unstructured context into the middle of the window and hope, short focused context wins.

Go further

What was the original Liu et al. finding?

Liu et al. (2023), in the paper titled 'Lost in the Middle', ran a key-value retrieval task where the gold document was placed at different positions in a long context. Accuracy was high when the relevant document was near the beginning or end and dropped sharply when it sat in the middle — a U-shaped curve that held across model sizes and across both open-source and proprietary models at the time. The paper title is where the term 'lost in the middle' comes from; context rot is the broader umbrella that includes this position effect alongside attention dilution and instruction drift.

Has this gotten better with newer long-context models?

Somewhat. Frontier models with native long-context training have flatter position curves, but the U-shape never fully disappears — even at 200K+ context lengths, recall on middle-positioned facts trails start-and-end recall by 5-20 absolute points on adversarial probes. Cumulative effects (instruction drift, faithfulness drop) also worsen as the prompt grows. The pragma stays: position important context at the boundaries, and shorten the prompt when possible.

What's the practical fix for RAG pipelines?

Reorder retrieved chunks so the highest-confidence documents sit at the start and end of the context block, padding the middle with lower-priority context. Some pipelines duplicate the top chunk at both boundaries. The trick assumes calibrated relevance scores from your reranker — uncalibrated scores produce arbitrary orderings that can place the wrong doc at a high-attention boundary.

ZeroEntropy
The best AI teams build with ZeroEntropy models
Follow us on
GitHubTwitterSlackLinkedInDiscord