Beta. Content is under active construction and has not been peer-reviewed. Report errors on GitHub.Disclaimer

AI Safety

Mechanistic Interpretability

Understanding what individual neurons and circuits compute inside neural networks: sparse autoencoders, superposition, induction heads, probing, and the limits of interpretability.

ResearchTier 2Frontier~55 min

Why This Matters

We deploy neural networks that make consequential decisions, but we do not understand what they compute internally. Mechanistic interpretability tries to reverse-engineer neural networks: identify what individual neurons represent, how they compose into circuits, and how those circuits implement algorithms.

This is not an academic exercise. If we cannot understand what a model computes, we cannot verify that it is safe. We cannot predict its behavior on novel inputs. We cannot diagnose failure modes. Interpretability is the scientific foundation for AI safety. without it, safety is just empirical testing and hope.

Mental Model

A neural network is a learned program. Its weights encode an algorithm for transforming inputs to outputs. Mechanistic interpretability treats the network as an object of scientific study: we form hypotheses about what specific components compute, design experiments to test those hypotheses, and build a mechanistic understanding of the computation.

The central challenge is superposition: models represent far more features than they have dimensions, encoding multiple concepts in overlapping patterns across neurons. This makes interpretation difficult because no single neuron corresponds to a single concept.

The Superposition Hypothesis

Definition

Superposition

A model exhibits superposition when it represents more features than it has dimensions by encoding features as nearly orthogonal directions in activation space. If a model has dd dimensions but needs to represent mdm \gg d features, it can do so (approximately) as long as most features are sparse (rarely active).

Proposition

Geometry of Superposition

Statement

Consider a model that represents mm features as unit vectors {v1,,vm}\{v_1, \ldots, v_m\} in Rd\mathbb{R}^d with m>dm > d. If features have sparsity ss (each feature is active with probability ss), the model can tolerate interference between features proportional to ss. Specifically, the expected squared interference for a feature ii when it is active is:

interferencei=ji(vivj)2sj\text{interference}_i = \sum_{j \neq i} (v_i \cdot v_j)^2 \cdot s_j

For the model to represent features accurately, it needs ji(vivj)2sj1\sum_{j \neq i} (v_i \cdot v_j)^2 \cdot s_j \ll 1. When features are sufficiently sparse, this is achievable even with mdm \gg d by using nearly orthogonal (but not exactly orthogonal) directions.

Intuition

Orthogonality is not binary. In dd dimensions, you can have at most dd exactly orthogonal vectors, but you can have exponentially many nearly orthogonal vectors. If most features are rarely active, the interference from non-orthogonal directions is tolerable. on any given input, only a few features are active, and their representations are approximately independent.

Why It Matters

Superposition explains why individual neurons are hard to interpret. A neuron does not represent a single concept. It participates in the representation of many features. To find interpretable features, we must look at directions in activation space, not individual neurons. This motivates sparse autoencoders.

Failure Mode

When too many features are active simultaneously (low sparsity), interference degrades representation quality. The model must choose between representing fewer features accurately (monosemantic) or more features approximately (polysemantic). This tradeoff depends on the distribution of feature importances and sparsities.

Sparse Autoencoders for Feature Extraction

Since features are directions in activation space (not individual neurons), we need a method to discover these directions. Sparse autoencoders (SAEs) learn to decompose model activations into interpretable features.

Definition

Sparse Autoencoder

Given model activations xRdx \in \mathbb{R}^d at some layer, a sparse autoencoder learns an encoder f:RdRmf: \mathbb{R}^d \to \mathbb{R}^m (with mdm \gg d) and decoder g:RmRdg: \mathbb{R}^m \to \mathbb{R}^d to minimize:

L=xg(f(x))2+λf(x)1\mathcal{L} = \|x - g(f(x))\|^2 + \lambda \|f(x)\|_1

The encoder typically has the form f(x)=ReLU(Wencx+benc)f(x) = \text{ReLU}(W_{\text{enc}} x + b_{\text{enc}}) and the decoder g(z)=Wdecz+bdecg(z) = W_{\text{dec}} z + b_{\text{dec}}.

The L1L_1 penalty encourages the hidden representation z=f(x)z = f(x) to be sparse: only a few features are active for any given input.

How SAEs find features. The columns of WdecW_{\text{dec}} are the learned feature directions. Each column corresponds to a feature, and the corresponding entry of f(x)f(x) is the activation of that feature on input xx. If the SAE is well-trained, each feature direction corresponds to an interpretable concept (a topic, a syntactic pattern, a factual association).

Scaling results. Anthropic's research has trained SAEs on Claude's residual stream and identified features corresponding to specific concepts: cities, programming languages, emotional tones, mathematical operations, and more abstract patterns. The reported finding is that many SAE features are more monosemantic than individual neurons. Absolute monosemanticity is not established: Templeton et al. (2024) and Chanin et al. (2024) document feature splitting, where a feature that looks monosemantic at one SAE width decomposes into several features at higher width. Treat "monosemantic" here as an empirical direction of improvement, not a proved property.

Circuits: Induction Heads and Beyond

Definition

Circuit

A circuit is a subgraph of the computational graph of a neural network that implements a specific, identifiable algorithm. Finding a circuit means identifying which attention heads, MLP neurons, and residual stream directions work together to produce a specific behavior.

Induction heads. A well-studied circuit in transformers, alongside IOI. An induction head implements the algorithm: "if pattern [A][B][A][A][B] \ldots [A] has appeared, predict [B][B]." This is a two-head circuit:

  1. A previous-token head writes information about the previous token into the residual stream
  2. An induction head attends to positions where the current token appeared before and copies what followed

Olsson et al. (2022) present induction heads as a leading candidate mechanism for a substantial fraction of in-context pattern copying, supported by loss bumps, transfer studies, and ablations. Later work shows other heads and circuits also contribute, so "induction heads implement all of in-context learning" is an overclaim. See the induction heads page for the causal vs correlational distinction.

Indirect object identification (IOI). Another well-studied circuit, reverse-engineered end-to-end by Wang, Variengien, Conmy, Shlegeris, and Steinhardt (2023, arXiv:2211.00593) in GPT-2 small. Given a sentence like "When Mary and John went to the store, John gave a drink to", the model must predict "Mary." The identified circuit uses Duplicate Token Heads and Induction Heads to detect the repeated name, S-Inhibition Heads to suppress it, and Name Mover Heads (with Negative and Backup Name Movers) to copy the remaining candidate into the final position.

Probing

Definition

Linear Probe

A linear probe trains a linear classifier f(x)=Wx+bf(x) = Wx + b on the internal representations xx of a neural network to predict some property (e.g., part of speech, sentiment, factual correctness). If the probe achieves high accuracy, the information is linearly accessible in the representation.

Proposition

Probing Accuracy and Representation Quality

Statement

Let V\mathcal{V} be the minimum achievable cross-entropy loss for a linear probe predicting label yy from representation xx:

V=minW,bE[logpW,b(yx)]\mathcal{V} = \min_{W,b} \mathbb{E}[-\log p_{W,b}(y|x)]

Then VH(YX)\mathcal{V} \geq H(Y|X) (the conditional entropy of yy given xx), because cross-entropy is always lower-bounded by the true conditional entropy. The probe achieves V=H(YX)\mathcal{V} = H(Y|X) only when the Bayes-optimal predictor p(yx)p(y|x) is representable by the linear probe class (rare except when xx linearly encodes logp(yx)\log p(y|x) up to a constant). A low V\mathcal{V}, close to H(YX)H(Y|X), indicates that the representation linearly encodes (most of) the information about yy accessible at all.

Intuition

Probing answers the question: is this information present and accessible in the representation? A linear probe can only extract information that is already encoded in linearly separable directions. High probe accuracy means the network explicitly represents the concept; low accuracy means the concept is either absent or encoded in a nonlinearly entangled way.

Why It Matters

Probing is the simplest and most widely used interpretability technique. It has revealed that transformer representations encode syntactic structure, semantic roles, world knowledge, and even spatial relationships in approximately linear subspaces.

Failure Mode

Probing has a fundamental limitation: high probe accuracy does not mean the model uses the information. A representation might linearly encode a concept that the downstream computation ignores. Probing is descriptive (what information is present) but not causal (what information is used). To establish causality, you need intervention experiments (e.g., activation patching).

Causal Intervention: Activation Patching

Probing tells us what information is in a representation. Activation patching tells us what information the model actually uses.

The method has two standard directions. Noising (resample ablation): run the model on a clean input, patch in activations from a corrupted run at one site, and measure how much the clean-run output degrades. A large degradation means the patched site carries information the clean prediction relies on. Denoising: run on a corrupted input, patch in the clean activation at one site, and measure how much the clean prediction is restored. A large restoration means the patched site is sufficient for the behavior. The two variants answer subtly different questions (necessity vs. sufficiency) and often give different localizations; see Heimersheim and Nanda (2024, arXiv:2404.15255) for a careful treatment. Meng, Bau, Andonian, and Belinkov (2022, ROME, arXiv:2202.05262) use causal tracing (a denoising variant) to localize factual associations to mid-layer MLPs in GPT.

Limitations of Current Interpretability

Scale. SAEs and circuit analysis have been demonstrated on individual behaviors and features. A full mechanistic understanding of a large language model (billions of parameters, thousands of features) does not yet exist.

Descriptive vs. predictive. Current interpretability is mostly descriptive: we can explain what a circuit does after the fact. We cannot yet reliably predict model behavior on novel inputs from mechanistic understanding alone.

Completeness. Identifying a circuit for one behavior does not mean we understand the model. Models have thousands of behaviors, and the interactions between circuits are poorly understood.

The interpretability illusion. A danger: interpretable-looking explanations may not be faithful to the actual computation. A feature labeled "this neuron detects dogs" might also activate on other stimuli in ways we have not tested.

Common Confusions

Watch Out

Neurons are not features

The fundamental insight of superposition: features are directions in activation space, not individual neurons. A neuron may participate in many features, and a feature may be distributed across many neurons. Looking at individual neurons gives a misleading picture of what the network represents.

Watch Out

Probing is not causal

High probing accuracy means information is present in the representation. It does not mean the model uses that information for its output. A model might linearly encode the color of an object but not use color information for its prediction. Only intervention experiments (activation patching, causal tracing) establish causal relevance.

Watch Out

Interpretability is not safety

Understanding what a model computes is necessary but not sufficient for safety. Even if we perfectly understood every circuit, we would still need to decide what behaviors are acceptable, how to modify undesirable behaviors, and how to verify that modifications work. Interpretability provides the scientific foundation, but policy and engineering are also required.

Summary

  • Superposition: models encode mdm \gg d features as nearly orthogonal directions in dd dimensions
  • Sparse autoencoders decompose activations into interpretable, monosemantic features
  • Circuits are subgraphs implementing specific algorithms (e.g., induction heads as a candidate mechanism for in-context pattern copying)
  • Probing reveals what information is present in representations (descriptive)
  • Activation patching reveals what information the model actually uses (causal)
  • Current interpretability is descriptive and partial. a complete mechanistic understanding of large models does not yet exist

Exercises

ExerciseCore

Problem

In R2\mathbb{R}^2, you can have at most 2 orthogonal unit vectors. How many unit vectors can you have with pairwise dot products at most 0.1 in absolute value? Give a lower bound.

ExerciseAdvanced

Problem

Suppose you have a trained SAE with decoder weights WdecRd×mW_{\text{dec}} \in \mathbb{R}^{d \times m}. A feature jj is active on an input xx with activation zj>0z_j > 0. Describe mathematically how you would test whether feature jj causally affects the model's prediction of token tt.

ExerciseResearch

Problem

The superposition hypothesis predicts a phase transition: as feature sparsity decreases (features become more frequently active), the model should transition from a superposition regime (many features, approximate representation) to a non-superposition regime (fewer features, exact representation). Describe an experiment to test this prediction in a toy model.

References

Canonical:

  • Elhage et al., "Toy Models of Superposition" (2022). Anthropic
  • Olsson et al., "In-context Learning and Induction Heads" (2022). Anthropic

Current:

  • Cunningham et al., "Sparse Autoencoders Find Highly Interpretable Features in Language Models" (2023)
  • Conmy et al., "Towards Automated Circuit Discovery for Mechanistic Interpretability" (2023)
  • Templeton et al., "Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet" (2024). Anthropic transformer-circuits.pub
  • Wang, Variengien, Conmy, Shlegeris, Steinhardt, "Interpretability in the Wild: a Circuit for Indirect Object Identification in GPT-2 small" (ICLR 2023, arXiv:2211.00593)
  • Chanin et al., "A Is for Absorption: Studying Feature Splitting and Absorption in Sparse Autoencoders" (2024)
  • Bricken et al., "Towards Monosemanticity: Decomposing Language Models With Dictionary Learning" (2023). Anthropic
  • Meng, Bau, Andonian, Belinkov, "Locating and Editing Factual Associations in GPT" (ROME, NeurIPS 2022, arXiv:2202.05262)
  • Heimersheim, Nanda, "How to use and interpret activation patching" (2024, arXiv:2404.15255)

Next Topics

The natural next steps from mechanistic interpretability:

Last reviewed: April 2026

Prerequisites

Foundations this topic depends on.

Builds on This

Next Topics