Mechanistic Interpretability

Q: What's activation patching and why is it the workhorse tool?

Run the model on a clean prompt, cache every activation. Run on a corrupted prompt where the answer should change, also cache. Now replay the corrupted run, but at one specific layer/position, swap in the clean activation. If the model now produces the clean answer, that activation was causally responsible. Iterating across layers and positions gives a causal map of which components matter for which computations. Path patching extends this to multi-step paths through the network.

Also known as: mech interp, circuits, MI

TL;DR

Reverse-engineering neural networks at the level of circuits — small subgraphs of attention heads and MLP neurons that implement specific, identifiable computations.

Mechanistic interpretability is the project of reverse-engineering neural networks — taking a trained transformer and producing a human-readable description of how it computes its outputs, not just what its outputs are. The unit of explanation is the circuit: a small subgraph of attention heads and MLP neurons across a few layers, working together to implement a specific identifiable computation.

The bet is that despite the apparent complexity of a 70B-parameter model, much of its computation is factorizable into small, named subroutines that compose to produce its behavior. The bet is partially supported: the field has produced concrete circuits for concrete behaviors, and those circuits hold up under causal intervention.

Concrete results

Circuits the field has actually identified

Induction heads (Olsson et al., 2022). A pair of attention heads in layers and that together implement “if pattern appeared earlier in the context, predict after ”. The first head copies position information, the second head uses it to attend to the right earlier token. Induction heads emerge during training in a sharp phase transition that coincides with the in-context-learning capability emerging — strong evidence that they’re the mechanism for short-range in-context learning.
Indirect Object Identification (IOI) (Wang et al., 2022). The GPT-2 small circuit that completes “When John and Mary went to the store, John gave a drink to ___” with “Mary”. Involves ~26 heads across 5 layers, with named roles: name mover heads attend to candidate names; S-inhibition heads suppress the subject so it doesn’t get re-selected; duplicate-token heads detect that “John” appears twice. The circuit is testable: ablate the S-inhibition heads and the model starts predicting “John”.
Modular addition (Nanda et al., 2023). A small transformer trained to compute doesn’t learn an arithmetic algorithm — it learns a Fourier representation, mapping inputs to and using trig identities to add them. The circuit is fully reverse-engineered down to the weights.
Refusal directions (Arditi et al., 2024). In a fine-tuned chat model, the decision to refuse an unsafe request is mediated by a single direction in the residual stream. Project this direction out and the model stops refusing; clamp it high and the model starts refusing harmless requests. This is mechanistic alignment evaluation, not behavioral.

The toolkit

A few specific techniques do most of the work:

Activation patching. Run the model on a clean prompt; cache every activation. Run on a corrupted prompt where the answer should change; cache again. Now replay the corrupted run, but at one specific layer/position swap in the cached clean activation. If the model recovers the clean answer, that activation was causally responsible. Iterate across layers and positions for a causal map.
Attribution patching. A linear approximation of activation patching that’s in network forward passes instead of . Cheaper but lossy; useful for first-pass localization before zooming in with full patching.
Path patching. Activation patching restricted to a specific path through the network — patch a head’s output only when it flows through a specific downstream component. Lets you decompose causality at the edge level, not just the node level.
Sparse autoencoders. SAEs give you a basis of monosemantic features in which to describe the activations being patched. Pre-SAE, you’d say “patching position 7 layer 8 changes the output”; post-SAE, you’d say “the ‘famous tourist landmark’ feature at position 7 layer 8 routes through the layer-9 attention head that writes to the prediction logits”.

SAEs are the latest unlock. Activation patching told you which components mattered; SAEs let you describe those components in human-readable feature terms instead of raw activation soup.

The standard methodology

Pick a behavior you want to explain. Construct a clean / corrupted dataset pair — clean prompts that elicit the behavior, corrupted prompts (often single-token edits) that should suppress it. Run activation patching across all layers and positions to find the components most causally responsible. Validate by ablating those components: does removing them break the behavior? Validate again by patching only those components: is that enough to reproduce it? Once a candidate circuit is identified, characterize each component’s role with targeted attention-pattern analysis or SAE feature inspection.

The whole pipeline is empirical and replicable. You don’t need new training runs; everything happens at inference time on a frozen model. A good interpretability project on a 1B-parameter model can be done on a single GPU in days.

Where the field is going

Three trends worth tracking:

Scaling SAEs to frontier models. Anthropic’s Scaling Monosemanticity trained SAEs on Claude 3 Sonnet and recovered millions of features at frontier scale. The recipe is no longer toy-model-only.
Automating interpretability. Using LLMs to describe SAE features and to propose circuits — closing the loop where the model under analysis helps generate the explanations. Early but promising.
Connecting circuits to alignment. If “the model is being deceptive” corresponds to a specific feature direction or circuit, alignment evaluation becomes a mechanistic question instead of a behavioral one. Behavioral evaluation has fundamental limits (the model can pass tests it would fail in deployment); mechanistic evaluation, in principle, doesn’t.

Two structural advantages over older interpretability efforts.

First, transformers are modular in a way CNNs and RNNs weren’t. Attention heads are explicit information-routing components with discrete, inspectable patterns. The residual stream is a linear bus that every layer reads and writes — additive contributions are separable in a way they aren’t in deeply non-linear architectures. The architecture itself is more amenable to decomposition.

Second, the field changed its question. Older interpretability asked “what does this neuron mean?” — a question that has no good answer because polysemanticity guarantees neurons don’t mean one thing. Mech interp asks “what computation does this subgraph implement on this specific input distribution?” — a much narrower, much more answerable question. SAEs then change the unit from neurons to features, sidestepping polysemanticity entirely.

The combination — a more decomposable architecture plus a more tractable question — is why concrete results have started to accumulate where for decades there were none.

This is a nascent field with real footholds. If you’re building production AI systems and you want to know why your model does what it does — not just that it does it — this is the discipline that will matter increasingly over the next several years.

Go further

What's a circuit, concretely?

A circuit is a small subgraph of the network — a handful of attention heads and MLP neurons across a few layers — that together implement a specific computation. The Indirect Object Identification (IOI) circuit in GPT-2 small involves about 26 attention heads across 5 layers cooperating to predict 'Mary' as the next token in 'When John and Mary went to the store, John gave a drink to ___'. Each head has a specific role: name movers, name inhibitors, duplicate token detectors. The whole subgraph is human-readable once you've identified it.

Attention Transformer

What's activation patching and why is it the workhorse tool?

Run the model on a clean prompt, cache every activation. Run on a corrupted prompt where the answer should change, also cache. Now replay the corrupted run, but at one specific layer/position, swap in the clean activation. If the model now produces the clean answer, that activation was causally responsible. Iterating across layers and positions gives a causal map of which components matter for which computations. Path patching extends this to multi-step paths through the network.

Sparse autoencoders Logits

Why should someone building production AI systems care about interpretability?

Three reasons. Debugging: when your model misbehaves on a specific input distribution, circuit-level tools tell you why — which features fire, which heads route them, where the failure happens. Safety: alignment evaluation increasingly leans on whether models have internal representations of deception, sycophancy, or refusal triggers, which you can only check mechanistically. Capability forecasting: knowing which circuits exist in a 7B model and don't exist in a 1B model lets you predict what scaling will and won't unlock. The field is nascent, but production teams that wait for it to mature will be debugging blind for years.

Sparse autoencoders Transformer

← All concepts

The best AI teams build with ZeroEntropy models

Book Demo View docs