Beta. Content is under active construction and has not been peer-reviewed. Report errors on GitHub.Disclaimer

LLM Construction

Sparse Autoencoders for Interpretability

Sparse autoencoders decompose neural network activations into interpretable feature directions by learning an overcomplete dictionary with sparsity constraints. They are the primary tool for extracting monosemantic features from polysemantic neurons.

AdvancedTier 2Frontier~50 min
0

Why This Matters

neuron 1neuron 2legalcolormathfoodmusicsportcodeweather8 features in 2D spaceNearly orthogonal packingWhy sparsity helpsInput 1:legal ONfood OFFInput 2:legal OFFfood ONRare:legal ONfood ONInterference only when both active (rare if sparse)Sparsity = low collision rate = tolerable interference

Individual neurons in neural networks are often polysemantic: a single neuron activates for multiple unrelated concepts. Neuron 3047 might fire for both "the color blue" and "legal terminology." This makes individual neurons poor units of analysis for understanding what a network has learned.

The superposition hypothesis explains why: networks represent more features than they have neurons by encoding features as directions in activation space, not as individual neuron axes. When features are sparse (each feature is active on only a small fraction of inputs), many more features can coexist in a lower-dimensional space through nearly-orthogonal packing.

Sparse autoencoders (SAEs) are the tool for recovering these features. They learn an overcomplete dictionary of feature directions from activation data, producing features that are more interpretable than raw neurons.

The Superposition Hypothesis

Proposition

Superposition Hypothesis

Statement

A neural network with activation dimension dd can represent FdF \gg d features by encoding each feature as a direction in Rd\mathbb{R}^d. Elhage et al. 2022's toy-model analysis shows that in certain regimes the number of representable features scales roughly as Fd/pF \sim d / p, where pp is the per-feature activation probability. In the very-sparse limit (p0p \to 0) FF can scale even faster, approaching the Kabatyanskii-Levenshtein bound of exp(Ω(dε2))\exp(\Omega(d \varepsilon^2)) nearly-orthogonal directions (pairwise inner product ε\leq \varepsilon). The d/pd/p approximation should be read as an informal scaling observation from a toy model, not a tight theoretical bound. See Scherlis, Sachan, Jermyn, Benton, Shlegeris 2022 ("Polysemanticity and Capacity in Neural Networks", arXiv:2210.01892) for a more rigorous capacity analysis.

The features are packed as nearly-orthogonal directions. The interference between any two features i,ji, j is proportional to fi,fj|\langle f_i, f_j \rangle|, which is small but nonzero. The network tolerates this interference because, by sparsity, it is rare for two interfering features to be active simultaneously.

Intuition

In Rd\mathbb{R}^d, you can fit exactly dd orthogonal directions. But if you allow small angles between directions (small interference), you can fit exponentially more nearly-orthogonal directions. This is the Johnson-Lindenstrauss phenomenon applied to feature representation. The network exploits sparsity: if two features that interfere are rarely active at the same time, the interference rarely causes errors.

Why It Matters

Superposition explains why looking at individual neurons is misleading. The meaningful unit is not a neuron axis but a feature direction. This is why you need sparse autoencoders: they recover the feature directions from the superposed representation, producing a dictionary of monosemantic features from a polysemantic activation space.

Failure Mode

The hypothesis is well-supported empirically in toy models and small transformers. It is less clear how precisely it applies to frontier models with billions of parameters, where the relationship between features, neurons, and directions may be more complex. The toy model results (Elhage et al., 2022) demonstrate superposition convincingly in controlled settings, but scaling to GPT-4-class models is ongoing work.

The SAE Objective

Proposition

Sparse Autoencoder Objective

Statement

A sparse autoencoder learns an encoder f:RdRMf: \mathbb{R}^d \to \mathbb{R}^M and decoder g:RMRdg: \mathbb{R}^M \to \mathbb{R}^d that minimize:

L=xg(f(x))22+λf(x)1\mathcal{L} = \|x - g(f(x))\|_2^2 + \lambda \|f(x)\|_1

where:

  • f(x)=ReLU(Wenc(xbdec)+benc)f(x) = \text{ReLU}(W_{\text{enc}}(x - b_{\text{dec}}) + b_{\text{enc}}), the encoder with pre-centering
  • g(z)=Wdecz+bdecg(z) = W_{\text{dec}} z + b_{\text{dec}}, the linear decoder
  • WencRM×dW_{\text{enc}} \in \mathbb{R}^{M \times d}, WdecRd×MW_{\text{dec}} \in \mathbb{R}^{d \times M}
  • λ>0\lambda > 0 controls the sparsity-reconstruction tradeoff

The decoder columns Wdec[:,j]W_{\text{dec}}[:, j] are the learned feature directions. The encoder activations f(x)jf(x)_j indicate how much feature jj is present in input xx.

Intuition

The SAE is solving dictionary learning: find a set of MM basis vectors (the decoder columns) such that each activation xx can be well-approximated as a sparse linear combination of these basis vectors. The L1 penalty ensures that only a few features are active for any given input, which is what makes the features interpretable (each feature corresponds to a specific, identifiable concept).

The overcomplete factor M/dM/d is typically 4x to 64x. A 64x SAE on a 4096-dimensional activation space learns \sim262,000 feature directions.

Why It Matters

The trained decoder columns become the interpretable feature dictionary. Each column represents a direction in activation space that corresponds to a specific concept, behavior, or pattern. Researchers can then study which features activate on which inputs, how features compose, and how the network uses features to produce outputs. This is the primary tool for understanding what transformer layers compute.

Failure Mode

The sparsity-reconstruction tradeoff is fundamental. Higher λ\lambda produces sparser (more interpretable) features but worse reconstruction. Lower λ\lambda produces better reconstruction but features that are less cleanly monosemantic. There is no universal "correct" λ\lambda. Additionally, SAEs can learn dead features (dictionary elements that never activate) and feature splitting (a single concept split across multiple dictionary elements). Both require careful training techniques to mitigate.

The Sparsity-Reconstruction Tradeoff

This is the central design tension in SAE work:

SettingSparsity (λ\lambda)Reconstruction qualityFeature interpretabilityRisk
Very high λ\lambdaVery sparsePoor (many features suppressed)High per-feature, but missing featuresMajor concepts unrepresented
Moderate λ\lambdaModerateGoodGoodSome polysemantic leakage
Very low λ\lambdaDenseNear-perfectPoor (features blend)Defeats the purpose

The explained variance R2=1xg(f(x))2/x2R^2 = 1 - \|x - g(f(x))\|^2 / \|x\|^2 is the standard metric for reconstruction quality. Modern SAEs on GPT-2-scale models achieve R2>0.90R^2 > 0.90 with L0 sparsity of 50-300 active features per input out of dictionaries of tens of thousands to millions of features (Bricken et al. 2023; Templeton et al. 2024). That is more than 99% of features zero for any given input.

What Good SAE Features Look Like

A well-trained SAE feature should:

  1. Activate on a coherent set of inputs. Feature 4821 fires on sentences about "legal proceedings" and nothing else.
  2. Have a clear causal effect when ablated. Zeroing feature 4821 removes legal-related behavior.
  3. Be sparse. It activates on < 1% of inputs.
  4. Not split. There should not be three other features that also fire on "legal proceedings."

In practice, the top SAE features (by activation frequency) tend to be highly interpretable. The long tail of rare features is harder to interpret because there are fewer examples to study.

Common Confusions

Watch Out

SAE features are not ground truth

SAE features are learned decompositions, not discovered physical entities. Two different SAE training runs with different random seeds will produce different features. The features are useful for analysis, but they are not the "true" features of the network in any objective sense. They are the best sparse decomposition that this particular SAE architecture found.

Watch Out

Overcomplete does not mean you find all features

An SAE with M=64dM = 64d dictionary elements does not necessarily find all the features the network uses. It finds the MM directions that best explain the activation distribution under the sparsity constraint. Some features may be missed, especially rare or subtle ones. Increasing MM helps but also increases the risk of feature splitting.

Watch Out

SAEs work on activations at a specific layer, not on the whole model

Each SAE is trained on the activations of one specific layer (or one specific component, like the MLP output or the residual stream at a specific position). Different layers may require different SAEs, and the features at different layers will be different. Understanding the full model requires training SAEs at multiple layers and studying how features compose across layers.

Exercises

ExerciseCore

Problem

An SAE with dictionary size M=16,384M = 16{,}384 is trained on d=512d = 512 dimensional activations. On a typical input, 150 features are nonzero. What is the overcomplete factor? What is the sparsity level? Why does the overcomplete factor need to be much larger than 1?

ExerciseAdvanced

Problem

Explain the connection between the SAE L1 penalty and the LASSO regression penalty. Why does L1 produce sparsity while L2 does not?

References

Canonical:

  • Olshausen & Field, "Emergence of simple-cell receptive field properties by learning a sparse code for natural images," Nature 381 (1996). The original sparse coding paper that motivates dictionary learning for feature extraction.
  • Elhage et al., "Toy Models of Superposition" (2022). The foundational paper on the superposition hypothesis.
  • Bricken et al., "Towards Monosemanticity: Decomposing Language Models With Dictionary Learning" (2023). Anthropic's SAE work on Claude.

Current:

  • Cunningham, Ewart, Riggs, Huben, Sharkey (2023), "Sparse Autoencoders Find Highly Interpretable Features in Language Models," arXiv:2309.08600 (ICLR 2024).
  • Templeton et al., "Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet" (2024). SAEs at frontier model scale.
  • Scherlis, Sachan, Jermyn, Benton, Shlegeris, "Polysemanticity and Capacity in Neural Networks" (2022), arXiv:2210.01892. Rigorous capacity analysis for superposition.

Next Topics

Last reviewed: April 2026

Prerequisites

Foundations this topic depends on.

Next Topics