The transformer, introduced in Attention Is All You Need (Vaswani et al., 2017), is the neural-network architecture that underlies essentially every modern language model — LLMs, embedding models, rerankers, code models, vision-language models. The 2017 paper proposed it as a translation architecture; within five years it had displaced every prior approach across language tasks and most of computer vision.
What’s in the box
A transformer is a stack of identical “blocks”. Each block does two things, in order:
Self-attention. Every token’s representation gets updated by reading from every other token’s representation, weighted by a learned similarity ( attention ). This is the long-range information transfer step.
Feed-forward. Each token’s updated representation is then run through a small per-position MLP — the same MLP applied independently at each position. This adds non-linearity and depth.
Both are wrapped in residual connections and layer normalization. Stack 12 to 200 of these blocks, train on huge data, and you get GPT, Claude, BERT, T5, Llama — they all use this pattern with mostly cosmetic variations.
Why it works
Three reasons it dominated:
Long-range dependency. A pronoun at position 1000 can directly attend to the noun at position 7. RNNs had to pass that information through 993 hidden-state updates.
Parallelism. Every position’s representation can be computed in parallel during training. RNNs are inherently sequential.
Capacity at scale. Stacking transformers with more parameters and more data keeps producing better models — the empirical scaling laws have held for years.
Variants you’ll encounter
Variants you'll encounter
Decoder-only (GPT, Claude, Llama) — the LLM workhorse. Generates left-to-right; each token attends only to previous tokens (causal mask).
Encoder-only (BERT, embedding models) — produces representations of an input. Bidirectional attention; doesn’t generate.
Encoder-decoder (T5, original transformer) — encode source, decode target. Common in translation, summarization.
Cross-encoder ( rerankers ) — encoder-only, but the input is [query, document] concatenated. Scores the joint pair.
Vision transformer (ViT) — same blocks, but the input tokens are flattened image patches. Underpins most modern vision-language models.
What’s wrong with them
The killer is the cost of self-attention in sequence length. A million-token context isn’t free; it’s a million-by-million attention matrix. Production work-arounds (sparse attention, sliding windows, FlashAttention, MQA/GQA) attack the constants and the IO; the core scaling remains a wall, which is why long-context LLMs are still expensive even when they’re available.
The fundamental cost is unchanged. What’s changed is the engineering around it: FlashAttention reorders the computation so the N x N matrix never materializes in slow memory, dropping memory cost from quadratic to linear; grouped-query attention shrinks the KV cache by sharing K/V heads across query heads; sliding-window or sparse-attention variants make most attention pairs zero by construction; and prompt-caching reuses the K/V state across calls so only the new tokens pay attention’s cost.
Stack these and a 1M-token context becomes feasible — but per-token compute and memory are still roughly proportional to the active attention pairs you keep, which is why long-context inference remains 10-100x more expensive per generated token than short-context inference. The wall is bent, not broken.
Go further
Why did transformers replace RNNs and LSTMs?
RNNs process tokens sequentially — token N can only see tokens 1..N-1 through a single hidden-state bottleneck. Transformers let every token attend directly to every other, which both captures long-range dependencies better and parallelizes across the sequence at training time. Modern hardware loves the second part.
What's the difference between encoder, decoder, and encoder-decoder transformers?
Encoders (BERT, embedding models, [bi-encoders](/concepts/bi-encoder/)) read a sequence and produce representations — they look both ways. Decoders (GPT, Claude, most LLMs) generate token-by-token left-to-right. Encoder-decoders (T5, original transformer) read a source then generate a target — common in translation. The architecture is the same primitives, different masking.
Why do transformers struggle with very long sequences?
Self-attention is O(n²) in sequence length — both compute and memory grow quadratically. Various tricks (sliding-window attention, sparse attention, RoPE, FlashAttention) reduce the constant or shape of this scaling, but it remains the dominant constraint on practical context window length.