Also known as: GPU memory, HBM SRAM hierarchy, memory hierarchy
TL;DR
Modern GPUs have three relevant memory levels: HBM (slow, abundant), SRAM (fast, tiny), and registers (fastest, tiniest). HBM bandwidth is roughly 1-3 TB/s; on-chip SRAM bandwidth is closer to 20 TB/s.
A GPU is not a uniform pile of memory feeding a uniform pile of compute. It’s a three-tier hierarchy — HBM, SRAM, registers — with bandwidths that differ by an order of magnitude per step and capacities that differ by three. Almost every interesting LLM-serving optimization in the last five years is a story about which tier the hot data lives in and how to keep it there. Reading the rest of the performance-engineering catalogue without this mental model is reading symptoms without the cause.
The three tiers, with numbers
H100 reference numbers (2026)
HBM3. ~80 GB capacity. ~3.35 TB/s bandwidth. The “main memory” of the GPU. Holds model weights, the KV cache, optimizer state during training. Slow relative to compute.
A100 numbers are the same shape, lower magnitudes (HBM ~2 TB/s, SRAM ~19 TB/s). Newer Hopper variants and Blackwell (B200) push HBM to 8 TB/s and add another tier of NVLink-coherent memory, but the basic ratio — SRAM ~6-10x faster than HBM, with ~1000x less capacity — has held for a decade.
Why the ratio matters
A modern tensor core does fused multiply-adds at roughly 1 TFLOPS per ~3 GB/s of memory you can supply it. So an H100 can sustain its 989 FP16 TFLOPS only if you feed it ~300 GB/s of operand data. HBM at 3.35 TB/s can sustain about 11 of those. If the data is in SRAM, the same compute capacity is fed by 20 TB/s — 6x more headroom.
Translation: a kernel that streams its operands from HBM (every multiply-add waits on global memory) maxes out at roughly one-tenth of peak FLOPS. A kernel that fits its operands in SRAM and reuses them many times can saturate. The art of GPU kernel writing is mostly the art of keeping the working set in SRAM long enough.
At decode, you process one new token at a time per sequence. The forward pass loads every weight in the model (140 GB at fp16 for a 70B model) to produce one token’s worth of output. Per loaded byte you do ~1 multiply-add — arithmetic intensity ~1. You’re spending ~99 percent of the time waiting on HBM and ~1 percent doing math.
Batching is the rescue: with 32 sequences in flight, you load weights once and produce 32 tokens. Arithmetic intensity rises by 32x; the GPU starts doing useful compute. This is why continuous batching and PagedAttention are so load-bearing — they exist to amortize the HBM trip across many concurrent sequences. The KV cache is the limit: each in-flight sequence consumes its own KV bytes, and at long context that runs out before compute does.
The asymmetry between prefill and decode comes from the same root. Prefill processes hundreds of tokens in parallel, hits arithmetic intensity in the hundreds, and runs near peak FLOPS. Decode is one-token-at-a-time and runs at HBM-bandwidth ceiling. Any production stack reports them separately because they live on opposite sides of the roofline.
What this implies for the stack
Almost every modern serving optimization is a memory-hierarchy trick translated into kernels.
FlashAttention . Tiles Q, K, V into SRAM-sized blocks; never materializes the N x N attention matrix in HBM. The entire 2-4x speedup is HBM traffic saved.
Kernel fusion. Combines two operations into one kernel so the intermediate doesn’t round-trip through HBM. Standard for layernorm + linear, GeLU + linear, RMSNorm + scaling.
KV-cache quantization (FP8, INT8). Shrinks HBM bytes per cached token by 2x; doubles decode throughput in the memory-bound regime.
MoE expert grouping. Loads only active experts’ weights from HBM per token, keeping the unused experts cold.
How to think about it
When a question is “why is X slow on a GPU,” the first model to reach for is: where does the data live, and how often does it cross between tiers? That single question answers most of the catalog. Arithmetic intensity formalizes it; the roofline model visualizes it; MFU quantifies how close you got to the ceiling. The hierarchy is the cause; everything else is consequence.
Go further
Why is HBM the bottleneck and not compute?
An H100 offers ~989 TFLOPS at FP16 and ~3.35 TB/s of HBM bandwidth. Per byte loaded from HBM, you have ~300 FLOPS to spend before compute becomes the bottleneck. A standard GEMM in LLM decode loads the entire weight matrix per token but only does one multiply-add per weight — arithmetic intensity around 1-2. You spend almost all the time waiting on HBM reads while the tensor cores idle. This is the memory-bound regime.
An H100 has roughly 50 MB of L2 cache shared across the chip and ~228 KB of shared memory (programmer-managed SRAM) per streaming multiprocessor. Tiny by host-RAM standards, but with 20 TB/s of bandwidth — 6-10x faster than HBM. Any computation you can rearrange to keep its hot data in SRAM rather than HBM gets a multiplicative speedup. FlashAttention is the canonical example.
HBM. It's far too large for SRAM — gigabytes per long-context sequence. The decode-step bottleneck is reading the cached K and V tensors from HBM into the compute units to score against the new query. Any optimization that shrinks the KV footprint (GQA, FP8 KV quantization, paged sharing of prefix blocks) shows up as decode-throughput gain because it reduces the HBM bytes moved per token.