ML Methods
Variational Autoencoders
Deriving the ELBO, the reparameterization trick for backpropagation through sampling, and how VAEs turn autoencoders into principled generative models via amortized variational inference.
Prerequisites
Why This Matters
The VAE is where deep learning meets probabilistic modeling. It solves a fundamental problem: how to learn a generative model when the data involves latent variables that make the marginal likelihood intractable. The solution, the evidence lower bound (ELBO) and amortized inference, relies on KL divergence to measure the gap between the approximate and true posterior. This is one of the most load-bearing constructions in modern ML and underpins much of generative AI.
Mental Model
You want to learn a generative model: sample from a simple prior (Gaussian), then decode into data . The problem is inference: given an observed , what likely generated it? The true posterior is intractable. The VAE learns an approximate posterior (the encoder) jointly with the generative model (the decoder) by maximizing a lower bound on the log-likelihood.
The Generative Model
VAE Generative Model
The VAE defines a latent variable model:
- Prior:
- Likelihood (decoder): , parameterized by a neural network that maps to the parameters of a distribution over
The marginal likelihood (evidence) is:
This integral is intractable for nonlinear decoders because it requires integrating over all possible latent codes.
Deriving the ELBO
The key insight: since we cannot compute directly, we derive a tractable lower bound.
Evidence Lower Bound (ELBO)
Statement
For any distribution , the log marginal likelihood satisfies:
This lower bound is called the ELBO (Evidence Lower Bound). Equality holds when , the true posterior.
Intuition
The ELBO has two terms pulling in opposite directions:
- Reconstruction term: encourages the decoder to reconstruct from codes sampled via the encoder. Wants the encoder to be informative.
- KL term: encourages the encoder distribution to stay close to the prior . Wants the latent space to be structured and smooth.
The tension between these terms is the VAE tradeoff: be informative enough to reconstruct, but regular enough that the latent space has meaningful structure for generation.
Proof Sketch
Start with the log-evidence and introduce :
Apply Jensen's inequality ( is concave):
Expand :
Why It Matters
The ELBO transforms an intractable maximum likelihood problem into a tractable optimization. The gap between and the ELBO is exactly . The approximation quality of the encoder. Maximizing the ELBO simultaneously fits the generative model and improves the approximate posterior.
Failure Mode
The ELBO can be loose if is too simple to approximate the true posterior (e.g., a diagonal Gaussian when the true posterior is multimodal). This leads to posterior collapse: the model ignores the latent variables (, KL ) and relies entirely on a powerful decoder.
An equivalent derivation shows the gap directly:
Since KL divergence is non-negative, .
The Reparameterization Trick
Reparameterization Trick
The reconstruction term requires sampling , but we cannot backpropagate through a sampling operation.
The reparameterization trick expresses the sample as a deterministic function of the parameters and an independent noise variable:
where and are the mean and standard deviation output by the encoder network.
Now the randomness is in (which does not depend on ), and is a differentiable function of . Standard backpropagation works.
Without reparameterization, you would need high-variance score function estimators (REINFORCE). The reparameterization trick gives low-variance gradient estimates, making VAE training practical.
The KL Term in Detail
For the standard VAE with Gaussian encoder and standard normal prior , the KL divergence has a closed form:
This is computed analytically. no sampling needed. Each latent dimension contributes independently, making it easy to monitor which dimensions are active (significantly different from the prior).
Connection to EM
Amortized Variational EM
The classical EM algorithm for latent variable models alternates:
- E-step: compute the posterior for each data point
- M-step: maximize the expected complete log-likelihood
The VAE can be viewed as amortized variational EM:
- The encoder replaces the E-step. It amortizes inference by learning a single network that works for all , rather than running separate optimization for each data point
- The decoder corresponds to the M-step
- Both are optimized jointly via gradient descent on the ELBO
Classical variational inference computes a separate for each observation (expensive). Amortization is what makes VAEs scalable: one forward pass through the encoder gives the approximate posterior for any .
Common Confusions
The KL term is not just a penalty. It has a precise information-theoretic meaning
A common misunderstanding is that the KL term is a "regularizer" added for convenience, like weight decay. It is not. The KL term arises necessarily from the ELBO derivation. It measures how much information the encoder extracts about the specific input beyond what the prior already provides. Setting the KL weight to anything other than 1 (as in beta-VAE) changes the objective away from a valid lower bound on .
VAEs do not optimize reconstruction plus a penalty
The ELBO looks like "reconstruction - KL", which tempts people to treat it as a penalized autoencoder. But the correct interpretation is: the ELBO is a lower bound on the log-evidence, derived from first principles. The reconstruction and KL terms are not independent objectives. They are two parts of a single variational inference procedure. Changing their relative weight changes the probabilistic semantics.
Posterior collapse is not a bug in the ELBO
When a powerful autoregressive decoder can model without using , the optimal solution sets (zero KL) and ignores the latent variables. This is actually the correct ELBO optimum. The model has discovered that latent variables are unnecessary. Whether this is desirable depends on whether you want meaningful latent representations (often yes) or just good (then it is fine).
Canonical Examples
VAE on MNIST
Encoder: two-layer MLP mapping (outputting and , each -dimensional). Decoder: MLP mapping with sigmoid output (Bernoulli likelihood). With , the latent space can be directly visualized: different digit classes cluster in different regions, and interpolating between two latent codes produces smooth morphing between digits.
Summary
- The ELBO:
- Gap between and ELBO is
- Reparameterization trick: enables backprop through sampling
- KL term has closed form for Gaussian and Gaussian prior
- VAE = amortized variational EM: encoder amortizes the E-step
- The KL term is not a regularizer. It is part of the variational bound
Exercises
Problem
Derive the closed-form KL divergence between (univariate) and .
Problem
Show that . Use this to explain why maximizing the ELBO tightens the bound.
Problem
In the beta-VAE, the objective is with . This is no longer a valid lower bound on . What is the beta-VAE actually optimizing from an information-theoretic perspective?
Related Comparisons
References
Canonical:
- Kingma & Welling, "Auto-Encoding Variational Bayes" (2014). The original VAE paper
- Rezende, Mohamed, Wierstra, "Stochastic Backpropagation and Approximate Inference" (2014)
- Doersch, "Tutorial on Variational Autoencoders" (2016), arXiv:1606.05908
Current:
-
Kingma, "An Introduction to Variational Autoencoders" (2019). excellent tutorial
-
Bishop, Pattern Recognition and Machine Learning (2006), Chapters 1-14
Next Topics
The natural next steps from VAEs:
- Diffusion models: a different approach to tractable generative modeling
- Normalizing flows: exact likelihood via invertible transformations
- Variational inference: the general framework behind the ELBO
Last reviewed: April 2026
Prerequisites
Foundations this topic depends on.
- AutoencodersLayer 2
- Feedforward Networks and BackpropagationLayer 2
- Differentiation in RnLayer 0A
- Sets, Functions, and RelationsLayer 0A
- Basic Logic and Proof TechniquesLayer 0A
- Matrix CalculusLayer 1
- The Jacobian MatrixLayer 0A
- The Hessian MatrixLayer 0A
- Activation FunctionsLayer 1
- Convex Optimization BasicsLayer 1
- Matrix Operations and PropertiesLayer 0A
- Vectors, Matrices, and Linear MapsLayer 0A
- Maximum Likelihood EstimationLayer 0B
- Common Probability DistributionsLayer 0A
Builds on This
- Diffusion ModelsLayer 4
- JEPA and Joint EmbeddingLayer 4
- Representation Learning TheoryLayer 3