LLM Construction
Sparse Autoencoders for Interpretability
Sparse autoencoders decompose neural network activations into interpretable feature directions by learning an overcomplete dictionary with sparsity constraints. They are the primary tool for extracting monosemantic features from polysemantic neurons.
Prerequisites
Why This Matters
Individual neurons in neural networks are often polysemantic: a single neuron activates for multiple unrelated concepts. Neuron 3047 might fire for both "the color blue" and "legal terminology." This makes individual neurons poor units of analysis for understanding what a network has learned.
The superposition hypothesis explains why: networks represent more features than they have neurons by encoding features as directions in activation space, not as individual neuron axes. When features are sparse (each feature is active on only a small fraction of inputs), many more features can coexist in a lower-dimensional space through nearly-orthogonal packing.
Sparse autoencoders (SAEs) are the tool for recovering these features. They learn an overcomplete dictionary of feature directions from activation data, producing features that are more interpretable than raw neurons.
The Superposition Hypothesis
Superposition Hypothesis
Statement
A neural network with activation dimension can represent features by encoding each feature as a direction in . Elhage et al. 2022's toy-model analysis shows that in certain regimes the number of representable features scales roughly as , where is the per-feature activation probability. In the very-sparse limit () can scale even faster, approaching the Kabatyanskii-Levenshtein bound of nearly-orthogonal directions (pairwise inner product ). The approximation should be read as an informal scaling observation from a toy model, not a tight theoretical bound. See Scherlis, Sachan, Jermyn, Benton, Shlegeris 2022 ("Polysemanticity and Capacity in Neural Networks", arXiv:2210.01892) for a more rigorous capacity analysis.
The features are packed as nearly-orthogonal directions. The interference between any two features is proportional to , which is small but nonzero. The network tolerates this interference because, by sparsity, it is rare for two interfering features to be active simultaneously.
Intuition
In , you can fit exactly orthogonal directions. But if you allow small angles between directions (small interference), you can fit exponentially more nearly-orthogonal directions. This is the Johnson-Lindenstrauss phenomenon applied to feature representation. The network exploits sparsity: if two features that interfere are rarely active at the same time, the interference rarely causes errors.
Why It Matters
Superposition explains why looking at individual neurons is misleading. The meaningful unit is not a neuron axis but a feature direction. This is why you need sparse autoencoders: they recover the feature directions from the superposed representation, producing a dictionary of monosemantic features from a polysemantic activation space.
Failure Mode
The hypothesis is well-supported empirically in toy models and small transformers. It is less clear how precisely it applies to frontier models with billions of parameters, where the relationship between features, neurons, and directions may be more complex. The toy model results (Elhage et al., 2022) demonstrate superposition convincingly in controlled settings, but scaling to GPT-4-class models is ongoing work.
The SAE Objective
Sparse Autoencoder Objective
Statement
A sparse autoencoder learns an encoder and decoder that minimize:
where:
- , the encoder with pre-centering
- , the linear decoder
- ,
- controls the sparsity-reconstruction tradeoff
The decoder columns are the learned feature directions. The encoder activations indicate how much feature is present in input .
Intuition
The SAE is solving dictionary learning: find a set of basis vectors (the decoder columns) such that each activation can be well-approximated as a sparse linear combination of these basis vectors. The L1 penalty ensures that only a few features are active for any given input, which is what makes the features interpretable (each feature corresponds to a specific, identifiable concept).
The overcomplete factor is typically 4x to 64x. A 64x SAE on a 4096-dimensional activation space learns 262,000 feature directions.
Why It Matters
The trained decoder columns become the interpretable feature dictionary. Each column represents a direction in activation space that corresponds to a specific concept, behavior, or pattern. Researchers can then study which features activate on which inputs, how features compose, and how the network uses features to produce outputs. This is the primary tool for understanding what transformer layers compute.
Failure Mode
The sparsity-reconstruction tradeoff is fundamental. Higher produces sparser (more interpretable) features but worse reconstruction. Lower produces better reconstruction but features that are less cleanly monosemantic. There is no universal "correct" . Additionally, SAEs can learn dead features (dictionary elements that never activate) and feature splitting (a single concept split across multiple dictionary elements). Both require careful training techniques to mitigate.
The Sparsity-Reconstruction Tradeoff
This is the central design tension in SAE work:
| Setting | Sparsity () | Reconstruction quality | Feature interpretability | Risk |
|---|---|---|---|---|
| Very high | Very sparse | Poor (many features suppressed) | High per-feature, but missing features | Major concepts unrepresented |
| Moderate | Moderate | Good | Good | Some polysemantic leakage |
| Very low | Dense | Near-perfect | Poor (features blend) | Defeats the purpose |
The explained variance is the standard metric for reconstruction quality. Modern SAEs on GPT-2-scale models achieve with L0 sparsity of 50-300 active features per input out of dictionaries of tens of thousands to millions of features (Bricken et al. 2023; Templeton et al. 2024). That is more than 99% of features zero for any given input.
What Good SAE Features Look Like
A well-trained SAE feature should:
- Activate on a coherent set of inputs. Feature 4821 fires on sentences about "legal proceedings" and nothing else.
- Have a clear causal effect when ablated. Zeroing feature 4821 removes legal-related behavior.
- Be sparse. It activates on < 1% of inputs.
- Not split. There should not be three other features that also fire on "legal proceedings."
In practice, the top SAE features (by activation frequency) tend to be highly interpretable. The long tail of rare features is harder to interpret because there are fewer examples to study.
Common Confusions
SAE features are not ground truth
SAE features are learned decompositions, not discovered physical entities. Two different SAE training runs with different random seeds will produce different features. The features are useful for analysis, but they are not the "true" features of the network in any objective sense. They are the best sparse decomposition that this particular SAE architecture found.
Overcomplete does not mean you find all features
An SAE with dictionary elements does not necessarily find all the features the network uses. It finds the directions that best explain the activation distribution under the sparsity constraint. Some features may be missed, especially rare or subtle ones. Increasing helps but also increases the risk of feature splitting.
SAEs work on activations at a specific layer, not on the whole model
Each SAE is trained on the activations of one specific layer (or one specific component, like the MLP output or the residual stream at a specific position). Different layers may require different SAEs, and the features at different layers will be different. Understanding the full model requires training SAEs at multiple layers and studying how features compose across layers.
Exercises
Problem
An SAE with dictionary size is trained on dimensional activations. On a typical input, 150 features are nonzero. What is the overcomplete factor? What is the sparsity level? Why does the overcomplete factor need to be much larger than 1?
Problem
Explain the connection between the SAE L1 penalty and the LASSO regression penalty. Why does L1 produce sparsity while L2 does not?
References
Canonical:
- Olshausen & Field, "Emergence of simple-cell receptive field properties by learning a sparse code for natural images," Nature 381 (1996). The original sparse coding paper that motivates dictionary learning for feature extraction.
- Elhage et al., "Toy Models of Superposition" (2022). The foundational paper on the superposition hypothesis.
- Bricken et al., "Towards Monosemanticity: Decomposing Language Models With Dictionary Learning" (2023). Anthropic's SAE work on Claude.
Current:
- Cunningham, Ewart, Riggs, Huben, Sharkey (2023), "Sparse Autoencoders Find Highly Interpretable Features in Language Models," arXiv:2309.08600 (ICLR 2024).
- Templeton et al., "Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet" (2024). SAEs at frontier model scale.
- Scherlis, Sachan, Jermyn, Benton, Shlegeris, "Polysemanticity and Capacity in Neural Networks" (2022), arXiv:2210.01892. Rigorous capacity analysis for superposition.
Next Topics
- Induction heads: a specific interpretable circuit discovered using mechanistic analysis
- Residual stream and transformer internals: the architectural framework for understanding where SAE features live
Last reviewed: April 2026
Prerequisites
Foundations this topic depends on.
- AutoencodersLayer 2
- Feedforward Networks and BackpropagationLayer 2
- Differentiation in RnLayer 0A
- Sets, Functions, and RelationsLayer 0A
- Basic Logic and Proof TechniquesLayer 0A
- Matrix CalculusLayer 1
- The Jacobian MatrixLayer 0A
- The Hessian MatrixLayer 0A
- Activation FunctionsLayer 1
- Convex Optimization BasicsLayer 1
- Matrix Operations and PropertiesLayer 0A
- Vectors, Matrices, and Linear MapsLayer 0A
- Mechanistic InterpretabilityLayer 4
- Transformer ArchitectureLayer 4
- Attention Mechanism TheoryLayer 4
- Softmax and Numerical StabilityLayer 1
- Principal Component AnalysisLayer 1
- Eigenvalues and EigenvectorsLayer 0A
- Singular Value DecompositionLayer 0A