AI Safety
Mechanistic Interpretability
Understanding what individual neurons and circuits compute inside neural networks: sparse autoencoders, superposition, induction heads, probing, and the limits of interpretability.
Prerequisites
Why This Matters
We deploy neural networks that make consequential decisions, but we do not understand what they compute internally. Mechanistic interpretability tries to reverse-engineer neural networks: identify what individual neurons represent, how they compose into circuits, and how those circuits implement algorithms.
This is not an academic exercise. If we cannot understand what a model computes, we cannot verify that it is safe. We cannot predict its behavior on novel inputs. We cannot diagnose failure modes. Interpretability is the scientific foundation for AI safety. without it, safety is just empirical testing and hope.
Mental Model
A neural network is a learned program. Its weights encode an algorithm for transforming inputs to outputs. Mechanistic interpretability treats the network as an object of scientific study: we form hypotheses about what specific components compute, design experiments to test those hypotheses, and build a mechanistic understanding of the computation.
The central challenge is superposition: models represent far more features than they have dimensions, encoding multiple concepts in overlapping patterns across neurons. This makes interpretation difficult because no single neuron corresponds to a single concept.
The Superposition Hypothesis
Superposition
A model exhibits superposition when it represents more features than it has dimensions by encoding features as nearly orthogonal directions in activation space. If a model has dimensions but needs to represent features, it can do so (approximately) as long as most features are sparse (rarely active).
Geometry of Superposition
Statement
Consider a model that represents features as unit vectors in with . If features have sparsity (each feature is active with probability ), the model can tolerate interference between features proportional to . Specifically, the expected squared interference for a feature when it is active is:
For the model to represent features accurately, it needs . When features are sufficiently sparse, this is achievable even with by using nearly orthogonal (but not exactly orthogonal) directions.
Intuition
Orthogonality is not binary. In dimensions, you can have at most exactly orthogonal vectors, but you can have exponentially many nearly orthogonal vectors. If most features are rarely active, the interference from non-orthogonal directions is tolerable. on any given input, only a few features are active, and their representations are approximately independent.
Why It Matters
Superposition explains why individual neurons are hard to interpret. A neuron does not represent a single concept. It participates in the representation of many features. To find interpretable features, we must look at directions in activation space, not individual neurons. This motivates sparse autoencoders.
Failure Mode
When too many features are active simultaneously (low sparsity), interference degrades representation quality. The model must choose between representing fewer features accurately (monosemantic) or more features approximately (polysemantic). This tradeoff depends on the distribution of feature importances and sparsities.
Sparse Autoencoders for Feature Extraction
Since features are directions in activation space (not individual neurons), we need a method to discover these directions. Sparse autoencoders (SAEs) learn to decompose model activations into interpretable features.
Sparse Autoencoder
Given model activations at some layer, a sparse autoencoder learns an encoder (with ) and decoder to minimize:
The encoder typically has the form and the decoder .
The penalty encourages the hidden representation to be sparse: only a few features are active for any given input.
How SAEs find features. The columns of are the learned feature directions. Each column corresponds to a feature, and the corresponding entry of is the activation of that feature on input . If the SAE is well-trained, each feature direction corresponds to an interpretable concept (a topic, a syntactic pattern, a factual association).
Scaling results. Anthropic's research has trained SAEs on Claude's residual stream and identified features corresponding to specific concepts: cities, programming languages, emotional tones, mathematical operations, and more abstract patterns. The reported finding is that many SAE features are more monosemantic than individual neurons. Absolute monosemanticity is not established: Templeton et al. (2024) and Chanin et al. (2024) document feature splitting, where a feature that looks monosemantic at one SAE width decomposes into several features at higher width. Treat "monosemantic" here as an empirical direction of improvement, not a proved property.
Circuits: Induction Heads and Beyond
Circuit
A circuit is a subgraph of the computational graph of a neural network that implements a specific, identifiable algorithm. Finding a circuit means identifying which attention heads, MLP neurons, and residual stream directions work together to produce a specific behavior.
Induction heads. A well-studied circuit in transformers, alongside IOI. An induction head implements the algorithm: "if pattern has appeared, predict ." This is a two-head circuit:
- A previous-token head writes information about the previous token into the residual stream
- An induction head attends to positions where the current token appeared before and copies what followed
Olsson et al. (2022) present induction heads as a leading candidate mechanism for a substantial fraction of in-context pattern copying, supported by loss bumps, transfer studies, and ablations. Later work shows other heads and circuits also contribute, so "induction heads implement all of in-context learning" is an overclaim. See the induction heads page for the causal vs correlational distinction.
Indirect object identification (IOI). Another well-studied circuit, reverse-engineered end-to-end by Wang, Variengien, Conmy, Shlegeris, and Steinhardt (2023, arXiv:2211.00593) in GPT-2 small. Given a sentence like "When Mary and John went to the store, John gave a drink to", the model must predict "Mary." The identified circuit uses Duplicate Token Heads and Induction Heads to detect the repeated name, S-Inhibition Heads to suppress it, and Name Mover Heads (with Negative and Backup Name Movers) to copy the remaining candidate into the final position.
Probing
Linear Probe
A linear probe trains a linear classifier on the internal representations of a neural network to predict some property (e.g., part of speech, sentiment, factual correctness). If the probe achieves high accuracy, the information is linearly accessible in the representation.
Probing Accuracy and Representation Quality
Statement
Let be the minimum achievable cross-entropy loss for a linear probe predicting label from representation :
Then (the conditional entropy of given ), because cross-entropy is always lower-bounded by the true conditional entropy. The probe achieves only when the Bayes-optimal predictor is representable by the linear probe class (rare except when linearly encodes up to a constant). A low , close to , indicates that the representation linearly encodes (most of) the information about accessible at all.
Intuition
Probing answers the question: is this information present and accessible in the representation? A linear probe can only extract information that is already encoded in linearly separable directions. High probe accuracy means the network explicitly represents the concept; low accuracy means the concept is either absent or encoded in a nonlinearly entangled way.
Why It Matters
Probing is the simplest and most widely used interpretability technique. It has revealed that transformer representations encode syntactic structure, semantic roles, world knowledge, and even spatial relationships in approximately linear subspaces.
Failure Mode
Probing has a fundamental limitation: high probe accuracy does not mean the model uses the information. A representation might linearly encode a concept that the downstream computation ignores. Probing is descriptive (what information is present) but not causal (what information is used). To establish causality, you need intervention experiments (e.g., activation patching).
Causal Intervention: Activation Patching
Probing tells us what information is in a representation. Activation patching tells us what information the model actually uses.
The method has two standard directions. Noising (resample ablation): run the model on a clean input, patch in activations from a corrupted run at one site, and measure how much the clean-run output degrades. A large degradation means the patched site carries information the clean prediction relies on. Denoising: run on a corrupted input, patch in the clean activation at one site, and measure how much the clean prediction is restored. A large restoration means the patched site is sufficient for the behavior. The two variants answer subtly different questions (necessity vs. sufficiency) and often give different localizations; see Heimersheim and Nanda (2024, arXiv:2404.15255) for a careful treatment. Meng, Bau, Andonian, and Belinkov (2022, ROME, arXiv:2202.05262) use causal tracing (a denoising variant) to localize factual associations to mid-layer MLPs in GPT.
Limitations of Current Interpretability
Scale. SAEs and circuit analysis have been demonstrated on individual behaviors and features. A full mechanistic understanding of a large language model (billions of parameters, thousands of features) does not yet exist.
Descriptive vs. predictive. Current interpretability is mostly descriptive: we can explain what a circuit does after the fact. We cannot yet reliably predict model behavior on novel inputs from mechanistic understanding alone.
Completeness. Identifying a circuit for one behavior does not mean we understand the model. Models have thousands of behaviors, and the interactions between circuits are poorly understood.
The interpretability illusion. A danger: interpretable-looking explanations may not be faithful to the actual computation. A feature labeled "this neuron detects dogs" might also activate on other stimuli in ways we have not tested.
Common Confusions
Neurons are not features
The fundamental insight of superposition: features are directions in activation space, not individual neurons. A neuron may participate in many features, and a feature may be distributed across many neurons. Looking at individual neurons gives a misleading picture of what the network represents.
Probing is not causal
High probing accuracy means information is present in the representation. It does not mean the model uses that information for its output. A model might linearly encode the color of an object but not use color information for its prediction. Only intervention experiments (activation patching, causal tracing) establish causal relevance.
Interpretability is not safety
Understanding what a model computes is necessary but not sufficient for safety. Even if we perfectly understood every circuit, we would still need to decide what behaviors are acceptable, how to modify undesirable behaviors, and how to verify that modifications work. Interpretability provides the scientific foundation, but policy and engineering are also required.
Summary
- Superposition: models encode features as nearly orthogonal directions in dimensions
- Sparse autoencoders decompose activations into interpretable, monosemantic features
- Circuits are subgraphs implementing specific algorithms (e.g., induction heads as a candidate mechanism for in-context pattern copying)
- Probing reveals what information is present in representations (descriptive)
- Activation patching reveals what information the model actually uses (causal)
- Current interpretability is descriptive and partial. a complete mechanistic understanding of large models does not yet exist
Exercises
Problem
In , you can have at most 2 orthogonal unit vectors. How many unit vectors can you have with pairwise dot products at most 0.1 in absolute value? Give a lower bound.
Problem
Suppose you have a trained SAE with decoder weights . A feature is active on an input with activation . Describe mathematically how you would test whether feature causally affects the model's prediction of token .
Problem
The superposition hypothesis predicts a phase transition: as feature sparsity decreases (features become more frequently active), the model should transition from a superposition regime (many features, approximate representation) to a non-superposition regime (fewer features, exact representation). Describe an experiment to test this prediction in a toy model.
References
Canonical:
- Elhage et al., "Toy Models of Superposition" (2022). Anthropic
- Olsson et al., "In-context Learning and Induction Heads" (2022). Anthropic
Current:
- Cunningham et al., "Sparse Autoencoders Find Highly Interpretable Features in Language Models" (2023)
- Conmy et al., "Towards Automated Circuit Discovery for Mechanistic Interpretability" (2023)
- Templeton et al., "Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet" (2024). Anthropic transformer-circuits.pub
- Wang, Variengien, Conmy, Shlegeris, Steinhardt, "Interpretability in the Wild: a Circuit for Indirect Object Identification in GPT-2 small" (ICLR 2023, arXiv:2211.00593)
- Chanin et al., "A Is for Absorption: Studying Feature Splitting and Absorption in Sparse Autoencoders" (2024)
- Bricken et al., "Towards Monosemanticity: Decomposing Language Models With Dictionary Learning" (2023). Anthropic
- Meng, Bau, Andonian, Belinkov, "Locating and Editing Factual Associations in GPT" (ROME, NeurIPS 2022, arXiv:2202.05262)
- Heimersheim, Nanda, "How to use and interpret activation patching" (2024, arXiv:2404.15255)
Next Topics
The natural next steps from mechanistic interpretability:
- Hallucination theory: can we find the circuits responsible for confabulation?
- RLHF and alignment: what does RLHF actually change inside the model?
Last reviewed: April 2026
Prerequisites
Foundations this topic depends on.
- Transformer ArchitectureLayer 4
- Attention Mechanism TheoryLayer 4
- Matrix Operations and PropertiesLayer 0A
- Sets, Functions, and RelationsLayer 0A
- Basic Logic and Proof TechniquesLayer 0A
- Softmax and Numerical StabilityLayer 1
- Feedforward Networks and BackpropagationLayer 2
- Differentiation in RnLayer 0A
- Matrix CalculusLayer 1
- The Jacobian MatrixLayer 0A
- The Hessian MatrixLayer 0A
- Activation FunctionsLayer 1
- Convex Optimization BasicsLayer 1
- Principal Component AnalysisLayer 1
- Eigenvalues and EigenvectorsLayer 0A
- Singular Value DecompositionLayer 0A