Also known as: P99, P95, tail latency, latency percentiles
TL;DR
P50 is the median; P95 and P99 are the 95th and 99th percentile latencies. The tail is what wakes oncall, not the median — a 200ms median with a 5s P99 means 1% of users see your system as broken.
In production retrieval, latency is not a single number. It’s a distribution, and the shape of the tail is what determines whether your system feels fast or broken. Two systems with identical 200ms median latency can feel completely different: one with a 250ms P99, one with a 5s P99. The first feels rock-solid; the second has 1% of requests that hang.
What the percentiles mean
P50 (median). Half of requests are faster than this. The “average user experience” — but only loosely.
P95. 95% of requests are faster. 5% are slower. At 100 requests/sec, that’s 5/sec hitting the slow path.
P99. 99% of requests are faster. 1% are slower. At enterprise scale (10K rps), 100/sec are slow.
P999. 0.1% are slower. Still meaningful at internet scale.
A “good” system has the tail close to the median. A 200ms P50 with a 250ms P99 means almost every user has a similar experience. A 200ms P50 with a 5s P99 means a meaningful fraction of users are seeing a different, broken-feeling system.
The fan-out amplification
A retrieval pipeline composed of 4 stages, each with P99 = 500ms (and P50 = 100ms each), has end-to-end P50 of ~400ms — but end-to-end P99 is closer to the sum of P99s, not to 4× the P50. The tail compounds. This is why you can’t reason about compound system latency from medians.
A simple model: if each stage’s slow path fires independently with probability 1%, the chance some stage hits its slow path on a given request is roughly . Your effective tail percentile shifts.
Latency distributions are typically log-normal or even heavier-tailed: most requests cluster near the median, but a small fraction extend far into the tail. For a heavy-tailed distribution, the mean is dragged up by the rare slow paths and stops reflecting any user’s actual experience — no one waited the mean amount of time. Worse, two operationally different distributions can share an identical mean: a system with uniform 300ms latency and a system with 280ms median plus a 5s P99 both report 320ms mean. Engineers who chase the mean end up optimizing the wrong target. Percentiles are the only summary statistic that survives the heavy-tail shape, which is why every production observability dashboard reports P50, P95, and P99 as first-class.
Optimize the median for cost. Optimize the tail for trust.
What lives in the tail
In production retrieval pipelines, tail events are typically:
Sources of P99 spikes
Cold cache — first request for an embedding or document hits disk; subsequent reads are RAM-fast.
GC and index merges — periodic background work in the index (Lucene segment merges, ANN graph rebuilds) blocks foreground reads.
Reranker batch wait — continuous batching has variable wait time depending on traffic patterns.
Cross-AZ network — most calls are intra-AZ; about 0.1% take the cross-AZ path and are roughly an order of magnitude slower.
Token-by-token LLM jitter — output tokens come at variable speed. A query that needs 100 output tokens is fine; one that needs 1000 spikes the latency.
Auto-scaling cold start — the first request that lands on a freshly provisioned replica pays the model-load cost.
How to operate against the tail
Measure percentiles, not averages. Your dashboard should show P50, P95, P99 as first-class. Hide the mean — it’s misleading for fat-tailed distributions.
Budget per stage. If your end-to-end P99 budget is 1s, allocate explicit per-stage budgets that sum to less. Track each stage’s P99 separately.
Hedging. For idempotent reads (vector search, reranker calls), issue a duplicate request after ms if the first hasn’t returned. Cuts P99 dramatically at modest cost increase.
Timeouts and fallbacks. Every external call has a timeout shorter than your budget. When it fires, fall back to a degraded result rather than blocking.
Caching . A cache hit eliminates not just the median latency but the tail — cached results don’t have GC pauses or cold-cache surprises.
Go further
Why does the tail get worse at scale?
If a single request has 1% chance of slow path, a fan-out request that depends on 10 sub-requests has roughly 10% chance — tail amplification. Retrieval pipelines (BM25 + vector + rerank + LLM) compose 4+ stages; each stage's P99 contributes to the end-to-end P99, not the median. This is why median latency is a near-useless production metric for compound systems.
GC pauses in the index. Cold-cache reads from disk. Cross-AZ network blips. Embedding model batch backpressure. ANN index merges. Reranker batch starvation when traffic is bursty. Each is silent when measured by averages; each is a P99 tail event. Observability that captures per-stage percentiles is the only way to find them.