Latency Tail (P95, P99)

Q: What causes tail spikes in retrieval pipelines?

GC pauses in the index. Cold-cache reads from disk. Cross-AZ network blips. Embedding model batch backpressure. ANN index merges. Reranker batch starvation when traffic is bursty. Each is silent when measured by averages; each is a P99 tail event. Observability that captures per-stage percentiles is the only way to find them.

Also known as: P99, P95, tail latency, latency percentiles

TL;DR

P50 is the median; P95 and P99 are the 95th and 99th percentile latencies. The tail is what wakes oncall, not the median — a 200ms median with a 5s P99 means 1% of users see your system as broken.

In production retrieval, latency is not a single number. It’s a distribution, and the shape of the tail is what determines whether your system feels fast or broken. Two systems with identical 200ms median latency can feel completely different: one with a 250ms P99, one with a 5s P99. The first feels rock-solid; the second has 1% of requests that hang.

What the percentiles mean

P50 (median). Half of requests are faster than this. The “average user experience” — but only loosely.
P95. 95% of requests are faster. 5% are slower. At 100 requests/sec, that’s 5/sec hitting the slow path.
P99. 99% of requests are faster. 1% are slower. At enterprise scale (10K rps), 100/sec are slow.
P999. 0.1% are slower. Still meaningful at internet scale.

A “good” system has the tail close to the median. A 200ms P50 with a 250ms P99 means almost every user has a similar experience. A 200ms P50 with a 5s P99 means a meaningful fraction of users are seeing a different, broken-feeling system.

The fan-out amplification

A retrieval pipeline composed of 4 stages, each with P99 = 500ms (and P50 = 100ms each), has end-to-end P50 of ~400ms — but end-to-end P99 is closer to the sum of P99s, not to 4× the P50. The tail compounds. This is why you can’t reason about compound system latency from medians.

A simple model: if each stage’s slow path fires independently with probability 1%, the chance some stage hits its slow path on a given request is roughly . Your effective tail percentile shifts.

Latency distributions are typically log-normal or even heavier-tailed: most requests cluster near the median, but a small fraction extend far into the tail. For a heavy-tailed distribution, the mean is dragged up by the rare slow paths and stops reflecting any user’s actual experience — no one waited the mean amount of time. Worse, two operationally different distributions can share an identical mean: a system with uniform 300ms latency and a system with 280ms median plus a 5s P99 both report 320ms mean. Engineers who chase the mean end up optimizing the wrong target. Percentiles are the only summary statistic that survives the heavy-tail shape, which is why every production observability dashboard reports P50, P95, and P99 as first-class.

Optimize the median for cost. Optimize the tail for trust.

What lives in the tail

In production retrieval pipelines, tail events are typically:

Sources of P99 spikes

Cold cache — first request for an embedding or document hits disk; subsequent reads are RAM-fast.
GC and index merges — periodic background work in the index (Lucene segment merges, ANN graph rebuilds) blocks foreground reads.
Reranker batch wait — continuous batching has variable wait time depending on traffic patterns.
Cross-AZ network — most calls are intra-AZ; about 0.1% take the cross-AZ path and are roughly an order of magnitude slower.
Token-by-token LLM jitter — output tokens come at variable speed. A query that needs 100 output tokens is fine; one that needs 1000 spikes the latency.
Auto-scaling cold start — the first request that lands on a freshly provisioned replica pays the model-load cost.

How to operate against the tail

Measure percentiles, not averages. Your dashboard should show P50, P95, P99 as first-class. Hide the mean — it’s misleading for fat-tailed distributions.
Budget per stage. If your end-to-end P99 budget is 1s, allocate explicit per-stage budgets that sum to less. Track each stage’s P99 separately.
Hedging. For idempotent reads (vector search, reranker calls), issue a duplicate request after ms if the first hasn’t returned. Cuts P99 dramatically at modest cost increase.
Timeouts and fallbacks. Every external call has a timeout shorter than your budget. When it fires, fall back to a degraded result rather than blocking.
Caching . A cache hit eliminates not just the median latency but the tail — cached results don’t have GC pauses or cold-cache surprises.

Go further

Why does the tail get worse at scale?

If a single request has 1% chance of slow path, a fan-out request that depends on 10 sub-requests has roughly 10% chance — tail amplification. Retrieval pipelines (BM25 + vector + rerank + LLM) compose 4+ stages; each stage's P99 contributes to the end-to-end P99, not the median. This is why median latency is a near-useless production metric for compound systems.

RAG First-pass retrieval

How tight should retrieval P99 be?

Depends on the surface. Interactive chat: end-to-end P99 < 3s, with retrieval allocated ~500ms budget. Voice / agent loops: end-to-end P99 < 1s, retrieval < 200ms. Background batch: P99 in seconds is fine. Allocate budget per stage and measure each stage's P99 separately.

Reranker Hybrid search

What causes tail spikes in retrieval pipelines?

GC pauses in the index. Cold-cache reads from disk. Cross-AZ network blips. Embedding model batch backpressure. ANN index merges. Reranker batch starvation when traffic is bursty. Each is silent when measured by averages; each is a P99 tail event. Observability that captures per-stage percentiles is the only way to find them.

LLM observability ANN nearest neighbor

← All concepts

The best AI teams build with ZeroEntropy models

Book Demo View docs