Beta. Content is under active construction and has not been peer-reviewed. Report errors on GitHub.Disclaimer

ML Methods

Autoencoders

Encoder-decoder architectures for unsupervised representation learning: undercomplete bottlenecks, sparse and denoising variants, and the connection between linear autoencoders and PCA.

AdvancedTier 2Stable~45 min

Why This Matters

Autoencoders are the simplest framework for unsupervised representation learning: compress data through a bottleneck, then reconstruct it, and the bottleneck representation captures the most important structure. This idea is foundational. It leads directly to variational autoencoders, which connect to probabilistic generative modeling, and to modern representation learning methods used in self-supervised learning.

Mental Model

Think of an autoencoder as a compression algorithm that learns what to compress. The encoder maps high-dimensional input to a low-dimensional code. The decoder maps the code back to the original space. Training minimizes reconstruction error. Whatever structure the code must capture to reconstruct well. that is the learned representation.

Formal Setup

Definition

Autoencoder

An autoencoder consists of two functions:

  • Encoder: fθ:RDRdf_\theta: \mathbb{R}^D \to \mathbb{R}^d mapping input xx to a latent code z=fθ(x)z = f_\theta(x)
  • Decoder: gϕ:RdRDg_\phi: \mathbb{R}^d \to \mathbb{R}^D mapping the code back to a reconstruction x^=gϕ(z)\hat{x} = g_\phi(z)

Training minimizes the reconstruction loss:

L(θ,ϕ)=1ni=1nxigϕ(fθ(xi))2\mathcal{L}(\theta, \phi) = \frac{1}{n}\sum_{i=1}^{n} \|x_i - g_\phi(f_\theta(x_i))\|^2

The composition gϕfθg_\phi \circ f_\theta learns to approximate the identity function, but only through the information bottleneck of the code zz.

Undercomplete Autoencoders

Definition

Undercomplete Autoencoder

An autoencoder is undercomplete when the code dimension dd is smaller than the input dimension DD (i.e., d<Dd < D). The bottleneck forces the encoder to learn a compressed representation that retains the most important information for reconstruction.

If dDd \geq D and the encoder/decoder have enough capacity, the autoencoder can learn the identity function. which captures nothing useful about the data structure. The bottleneck is what makes autoencoders nontrivial.

The Linear Autoencoder = PCA Connection

Theorem

Linear Autoencoders Recover PCA

Statement

If the encoder and decoder are both linear (no nonlinearity), then the optimal dd-dimensional autoencoder (minimizing reconstruction error) recovers the subspace spanned by the top dd principal components of the data. Specifically, the optimal encoder projects onto the span of the top dd eigenvectors of the data covariance matrix Σ=1nXX\Sigma = \frac{1}{n}X^\top X.

Intuition

PCA finds the dd-dimensional subspace that preserves the most variance; equivalently, that minimizes reconstruction error under orthogonal projection. A linear autoencoder with a dd-dimensional bottleneck is solving exactly the same optimization problem, just without the orthogonality constraint. The optimal solution finds the same subspace (though the encoder/decoder matrices may not be orthogonal individually).

Proof Sketch

Let the encoder be z=Wxz = Wx with WRd×DW \in \mathbb{R}^{d \times D} and decoder be x^=Wz\hat{x} = W'z with WRD×dW' \in \mathbb{R}^{D \times d}. The reconstruction loss is xWWx2\|x - W'Wx\|^2. The optimal WWW'W is the rank-dd projection onto the top dd eigenvectors of the covariance matrix, by the Eckart-Young theorem (best rank-dd approximation in Frobenius norm comes from truncated SVD).

Why It Matters

This theorem reveals that nonlinearity is what makes deep autoencoders more powerful than PCA. Without nonlinear activations, you are just doing PCA with extra steps. Nonlinear autoencoders can learn curved, nonlinear manifolds that PCA (restricted to linear subspaces) cannot capture.

Failure Mode

The equivalence is exact only for linear activations and squared loss. With nonlinear activations, the autoencoder can capture nonlinear structure but also has more complex loss landscapes with local minima.

Sparse Autoencoders

Definition

Sparse Autoencoder

A sparse autoencoder uses an overcomplete code (dDd \geq D) but adds a sparsity penalty on the activations:

L=1ni=1nxigϕ(fθ(xi))2+λj=1dzj\mathcal{L} = \frac{1}{n}\sum_{i=1}^{n}\|x_i - g_\phi(f_\theta(x_i))\|^2 + \lambda \sum_{j=1}^{d} |z_j|

The 1\ell_1 penalty (or KL divergence from a target activation level) encourages most code units to be zero for any given input. Each input activates only a small subset of features, and each feature specializes in a particular pattern.

Sparse autoencoders are useful for feature discovery: the learned features are often interpretable (edges, textures, object parts for images). They are also used in mechanistic interpretability to decompose neural network activations into interpretable directions.

Denoising Autoencoders

Definition

Denoising Autoencoder

A denoising autoencoder corrupts each input xx to produce x~\tilde{x} (e.g., by adding Gaussian noise or randomly masking features) and trains to reconstruct the clean original:

L=1ni=1nxigϕ(fθ(x~i))2\mathcal{L} = \frac{1}{n}\sum_{i=1}^{n}\|x_i - g_\phi(f_\theta(\tilde{x}_i))\|^2

where x~i=xi+ε\tilde{x}_i = x_i + \varepsilon with εN(0,σ2I)\varepsilon \sim \mathcal{N}(0, \sigma^2 I) or x~i\tilde{x}_i is xix_i with random entries zeroed out.

The denoising objective prevents the autoencoder from learning the identity even without a bottleneck. To reconstruct clean data from noisy input, the model must learn the data manifold structure. It must know what "clean" looks like. This connects to score matching: the optimal denoising function is related to the gradient of the log data density.

Canonical Examples

Example

Image reconstruction with convolutional autoencoder

For 28×2828 \times 28 MNIST images (D=784D = 784), a convolutional encoder reduces spatial dimensions via strided convolutions down to a d=32d = 32 dimensional code vector. The decoder uses transposed convolutions to upsample back to 28×2828 \times 28. Reconstruction error is typically MSE. The 32-dimensional code captures digit identity, stroke thickness, and orientation. The factors of variation that matter for reconstruction.

Example

Linear autoencoder on 2D data

Consider 1000 points in R10\mathbb{R}^{10} lying near a 2D plane (with small noise). A linear autoencoder with d=2d = 2 learns to project onto this plane, exactly what PCA does. The reconstruction error equals the variance in the 8 discarded dimensions. A nonlinear autoencoder with d=2d = 2 could capture a curved manifold that PCA misses.

Common Confusions

Watch Out

Overcomplete autoencoders are not useless

If d>Dd > D with no regularization, the autoencoder can learn the identity (trivial solution). But with sparsity, denoising, or other regularization, overcomplete autoencoders learn useful representations. The constraint is not just dimensionality. It is the information bottleneck in a broader sense.

Watch Out

Autoencoders are not generative models by default

A standard autoencoder learns to reconstruct training data, but the latent space has no particular structure. Sampling a random point in the latent space and decoding it typically produces garbage. Making autoencoders generative requires imposing structure on the latent space. This is exactly what variational autoencoders do.

Summary

  • Autoencoders learn representations by minimizing reconstruction error through a bottleneck
  • Undercomplete (d<Dd < D): bottleneck forces compression
  • Linear autoencoder with squared loss = PCA (same subspace, possibly different basis)
  • Sparse autoencoders: overcomplete code with 1\ell_1 penalty encourages interpretable features
  • Denoising autoencoders: corrupt input, reconstruct clean. learns the data manifold without needing a bottleneck
  • Standard autoencoders are not generative models. The latent space lacks structure for sampling

Exercises

ExerciseCore

Problem

A linear autoencoder with encoder WR2×10W \in \mathbb{R}^{2 \times 10} and decoder WR10×2W' \in \mathbb{R}^{10 \times 2} is trained on centered data. What is the rank of the reconstruction WWxW'Wx, and how does this relate to PCA?

ExerciseAdvanced

Problem

Why does a denoising autoencoder learn useful representations even without a bottleneck (when d=Dd = D)? What prevents it from learning the identity?

References

Canonical:

  • Goodfellow, Bengio, Courville, Deep Learning (2016), Chapter 14
  • Vincent et al., "Extracting and Composing Robust Features with Denoising Autoencoders" (2008)

Current:

  • Cunningham et al., "Sparse Autoencoders Find Highly Interpretable Features in Language Models" (2023)

  • Bishop, Pattern Recognition and Machine Learning (2006), Chapters 1-14

  • Murphy, Machine Learning: A Probabilistic Perspective (2012), Chapters 1-28

Next Topics

The natural next steps from autoencoders:

  • Variational autoencoders: making the latent space generative
  • Dimensionality reduction: PCA and nonlinear alternatives

Last reviewed: April 2026

Prerequisites

Foundations this topic depends on.

Builds on This

Next Topics