Latency Tail (P95, P99)

Also known as: P99, P95, tail latency, latency percentiles

TL;DR

P50 is the median; P95 and P99 are the 95th and 99th percentile latencies. The tail is what wakes oncall, not the median — a 200ms median with a 5s P99 means 1% of users see your system as broken.

In production retrieval, latency is not a single number. It’s a distribution, and the shape of the tail is what determines whether your system feels fast or broken. Two systems with identical 200ms median latency can feel completely different: one with a 250ms P99, one with a 5s P99. The first feels rock-solid; the second has 1% of requests that hang.

What the percentiles mean

  • P50 (median). Half of requests are faster than this. The “average user experience” — but only loosely.
  • P95. 95% of requests are faster. 5% are slower. At 100 requests/sec, that’s 5/sec hitting the slow path.
  • P99. 99% of requests are faster. 1% are slower. At enterprise scale (10K rps), 100/sec are slow.
  • P999. 0.1% are slower. Still meaningful at internet scale.

A “good” system has the tail close to the median. A 200ms P50 with a 250ms P99 means almost every user has a similar experience. A 200ms P50 with a 5s P99 means a meaningful fraction of users are seeing a different, broken-feeling system.

The fan-out amplification

A retrieval pipeline composed of 4 stages, each with P99 = 500ms (and P50 = 100ms each), has end-to-end P50 of ~400ms — but end-to-end P99 is closer to the sum of P99s, not to 4× the P50. The tail compounds. This is why you can’t reason about compound system latency from medians.

A simple model: if each stage’s slow path fires independently with probability 1%, the chance some stage hits its slow path on a given request is roughly . Your effective tail percentile shifts.

Latency distributions are typically log-normal or even heavier-tailed: most requests cluster near the median, but a small fraction extend far into the tail. For a heavy-tailed distribution, the mean is dragged up by the rare slow paths and stops reflecting any user’s actual experience — no one waited the mean amount of time. Worse, two operationally different distributions can share an identical mean: a system with uniform 300ms latency and a system with 280ms median plus a 5s P99 both report 320ms mean. Engineers who chase the mean end up optimizing the wrong target. Percentiles are the only summary statistic that survives the heavy-tail shape, which is why every production observability dashboard reports P50, P95, and P99 as first-class.

Optimize the median for cost. Optimize the tail for trust.

What lives in the tail

In production retrieval pipelines, tail events are typically:

Sources of P99 spikes
  • Cold cache — first request for an embedding or document hits disk; subsequent reads are RAM-fast.
  • GC and index merges — periodic background work in the index (Lucene segment merges, ANN graph rebuilds) blocks foreground reads.
  • Reranker batch wait — continuous batching has variable wait time depending on traffic patterns.
  • Cross-AZ network — most calls are intra-AZ; about 0.1% take the cross-AZ path and are roughly an order of magnitude slower.
  • Token-by-token LLM jitter — output tokens come at variable speed. A query that needs 100 output tokens is fine; one that needs 1000 spikes the latency.
  • Auto-scaling cold start — the first request that lands on a freshly provisioned replica pays the model-load cost.

How to operate against the tail

  • Measure percentiles, not averages. Your dashboard should show P50, P95, P99 as first-class. Hide the mean — it’s misleading for fat-tailed distributions.
  • Budget per stage. If your end-to-end P99 budget is 1s, allocate explicit per-stage budgets that sum to less. Track each stage’s P99 separately.
  • Hedging. For idempotent reads (vector search, reranker calls), issue a duplicate request after ms if the first hasn’t returned. Cuts P99 dramatically at modest cost increase.
  • Timeouts and fallbacks. Every external call has a timeout shorter than your budget. When it fires, fall back to a degraded result rather than blocking.
  • . A cache hit eliminates not just the median latency but the tail — cached results don’t have GC pauses or cold-cache surprises.
Go further

Why does the tail get worse at scale?

If a single request has 1% chance of slow path, a fan-out request that depends on 10 sub-requests has roughly 10% chance — tail amplification. Retrieval pipelines (BM25 + vector + rerank + LLM) compose 4+ stages; each stage's P99 contributes to the end-to-end P99, not the median. This is why median latency is a near-useless production metric for compound systems.

How tight should retrieval P99 be?

Depends on the surface. Interactive chat: end-to-end P99 < 3s, with retrieval allocated ~500ms budget. Voice / agent loops: end-to-end P99 < 1s, retrieval < 200ms. Background batch: P99 in seconds is fine. Allocate budget per stage and measure each stage's P99 separately.

What causes tail spikes in retrieval pipelines?

GC pauses in the index. Cold-cache reads from disk. Cross-AZ network blips. Embedding model batch backpressure. ANN index merges. Reranker batch starvation when traffic is bursty. Each is silent when measured by averages; each is a P99 tail event. Observability that captures per-stage percentiles is the only way to find them.

ZeroEntropy
The best AI teams build with ZeroEntropy models
Follow us on
GitHubTwitterSlackLinkedInDiscord