Throughput (Tokens per Second)

Also known as: tokens per second, tps, tok/s

TL;DR

Tokens per second per GPU is the production planning metric for LLM serving. Throughput scales with batch size up to a memory-bound ceiling, then plateaus. The key number for capacity planning, autoscaling, and unit-economic analysis.

Throughput is the production planning metric for LLM serving. Specifically: tokens generated per second per GPU. Everything from how many GPUs you need, to how much each token costs, to how long users wait — derives from this number.

What’s a “good” number?

Order of magnitude, on H100s in 2026:

Reference numbers

7B dense model, fp16, batch 1. ~80-120 tok/s — the bare-metal minimum.
7B dense, fp16, large batch. 2,000-5,000 tok/s with continuous batching, depending on prompt/generation balance.
70B dense, fp16, batch 1. ~20-30 tok/s.
70B dense, large batch. 600-1,500 tok/s.
Quantized (int8/int4) variants. Roughly 1.5-2× of fp16.
Speculative decoding. Another 2-4× on top of batched throughput.

Different stacks (vLLM, TGI, TensorRT-LLM, SGLang) hit different points; benchmarks vary by ~30% between them on identical hardware.

Why batching helps so much

LLM inference is memory-bound. For each forward pass, the GPU has to load tens of GB of weights from HBM into compute units; the actual matrix multiplies are fast in comparison. With batch 1, you load all weights to produce one token. With batch 32, you load the same weights and produce 32 tokens — same memory traffic, 32× the work.

This is why throughput scales nearly linearly with batch size up to a ceiling, then plateaus. The ceiling is set by:

VRAM for the KV cache. Each in-flight request has a KV cache proportional to its current sequence length. With 70B models, KV cache is the bottleneck — typical limits are 32-256 concurrent sequences.
Compute saturation. At very large batches, you stop being memory-bound and start being compute-bound. The marginal token costs more.

The honest method is empirical: spin up the inference engine, ramp concurrent requests, and watch the throughput curve. The shape is consistent — a roughly linear region, a knee, then a plateau (sometimes followed by a small dip as scheduler overhead grows).

The analytic shortcut: model bandwidth-bound throughput as , where is HBM bandwidth (~3 TB/s on H100), is the weight memory load per forward pass, and is the per-position cost. For a 70B fp16 model is ~140 GB; at batch 1 you’re loading 140 GB per generated token, which is exactly why batch-1 throughput is so dismal. At batch 32 you’ve amortized that to ~4.4 GB per token, but the KV-cache term has grown.

The cleanest way to push the ceiling is to lower directly (quantization) or to amortize across more in-flight tokens (continuous batching, speculative decoding, longer generations). Bigger machines with more HBM raise the ceiling but never bend the curve’s shape.

Prefill vs decode

A request has two phases:

Prefill. Process the entire prompt in parallel — fast, compute-bound. ~1000s of tok/s/GPU.
Decode. Generate output tokens one at a time — slow, memory-bound. ~30-100 tok/s/GPU at batch 1.

Production throughput numbers usually mean decode throughput, since prefill is rarely the bottleneck. But long prompts with short outputs are prefill-dominated, and you need different tuning. Vendors increasingly report prefill and decode separately.

Throughput vs latency

Bigger batches improve throughput but worsen per-request latency — each request waits in queue longer, and competes for the same KV cache memory. The fundamental tradeoff:

Small batch (low concurrency). Low latency, low throughput. Expensive per token.
Large batch (high concurrency). High throughput, higher latency. Cheap per token.

Continuous batching (vLLM-style) finds a much better Pareto frontier than naive static batching by letting requests join and leave the in-flight batch dynamically.

Why it matters for rerankers

A reranker is called on every retrieval, so its tokens/sec/GPU sets the number of concurrent users per GPU dollar. Sub-1B specialized cross-encoders sit in the sweet spot: big enough to encode strong relevance signal, small enough that an H100 sustains thousands of relevance judgments per second.

Go further

Why is throughput per GPU the right unit, not requests per second?

Different requests have different token counts (prompt + generation). A query with a 100-token answer is much cheaper than one with 1000. Tokens/sec normalizes against this; requests/sec doesn't. For capacity planning and unit economics, tokens/sec is what aligns with cost (you pay for GPU time, which produces tokens).

Cost per token Latency tail

What sets the ceiling on throughput?

Memory bandwidth, almost always. LLM inference is memory-bound — most GPU time is spent moving weights from HBM to compute units, not computing. Batching amortizes this across requests, but eventually you saturate the bandwidth or run out of VRAM for the KV cache. That's the ceiling.

Continuous batching KV cache

Throughput vs latency — can you have both?

Partially. Bigger batches give more throughput but worse per-request latency (each request waits longer in the batch). Continuous batching and speculative decoding are the techniques that improve one without ruining the other. Production serving stacks (vLLM, TGI, TensorRT-LLM) put substantial engineering into the Pareto frontier.

Speculative decoding Continuous batching

← All concepts

The best AI teams build with ZeroEntropy models

Book Demo View docs