Beta. Content is under active construction and has not been peer-reviewed. Report errors on GitHub.Disclaimer

ML Methods

Variational Autoencoders

Deriving the ELBO, the reparameterization trick for backpropagation through sampling, and how VAEs turn autoencoders into principled generative models via amortized variational inference.

AdvancedTier 1Stable~65 min

Why This Matters

Input xe.g. imageEncoderq(z|x)neural networkoutputs μ and σμσε ~ N(0,1)Reparamz = μ + σ·εzDecoderp(x|z)neural networkreconstructs xreconstructionL = reconstruction loss + KL(q(z|x) ‖ p(z))‖x - x̂‖²pushes q toward N(0,1)Reparameterization trick: z = μ + σ·ε makes sampling differentiable (gradients flow through μ and σ, not through the sample)

The VAE is where deep learning meets probabilistic modeling. It solves a fundamental problem: how to learn a generative model p(x)p(x) when the data involves latent variables zz that make the marginal likelihood p(x)=p(xz)p(z)dzp(x) = \int p(x|z)p(z)dz intractable. The solution, the evidence lower bound (ELBO) and amortized inference, relies on KL divergence to measure the gap between the approximate and true posterior. This is one of the most load-bearing constructions in modern ML and underpins much of generative AI.

Mental Model

You want to learn a generative model: sample zz from a simple prior (Gaussian), then decode zz into data xx. The problem is inference: given an observed xx, what zz likely generated it? The true posterior p(zx)p(z|x) is intractable. The VAE learns an approximate posterior qϕ(zx)q_\phi(z|x) (the encoder) jointly with the generative model pθ(xz)p_\theta(x|z) (the decoder) by maximizing a lower bound on the log-likelihood.

The Generative Model

Definition

VAE Generative Model

The VAE defines a latent variable model:

  1. Prior: zp(z)=N(0,I)z \sim p(z) = \mathcal{N}(0, I)
  2. Likelihood (decoder): xpθ(xz)x \sim p_\theta(x|z), parameterized by a neural network that maps zz to the parameters of a distribution over xx

The marginal likelihood (evidence) is:

pθ(x)=pθ(xz)p(z)dzp_\theta(x) = \int p_\theta(x|z) p(z) \, dz

This integral is intractable for nonlinear decoders because it requires integrating over all possible latent codes.

Deriving the ELBO

The key insight: since we cannot compute logpθ(x)\log p_\theta(x) directly, we derive a tractable lower bound.

Theorem

Evidence Lower Bound (ELBO)

Statement

For any distribution qϕ(zx)q_\phi(z|x), the log marginal likelihood satisfies:

logpθ(x)Eqϕ(zx)[logpθ(xz)]reconstructionDKL(qϕ(zx)p(z))regularization=L(θ,ϕ;x)\log p_\theta(x) \geq \underbrace{\mathbb{E}_{q_\phi(z|x)}[\log p_\theta(x|z)]}_{\text{reconstruction}} - \underbrace{D_{\text{KL}}(q_\phi(z|x) \| p(z))}_{\text{regularization}} = \mathcal{L}(\theta, \phi; x)

This lower bound is called the ELBO (Evidence Lower Bound). Equality holds when qϕ(zx)=pθ(zx)q_\phi(z|x) = p_\theta(z|x), the true posterior.

Intuition

The ELBO has two terms pulling in opposite directions:

  • Reconstruction term: encourages the decoder to reconstruct xx from codes sampled via the encoder. Wants the encoder to be informative.
  • KL term: encourages the encoder distribution qϕ(zx)q_\phi(z|x) to stay close to the prior p(z)=N(0,I)p(z) = \mathcal{N}(0, I). Wants the latent space to be structured and smooth.

The tension between these terms is the VAE tradeoff: be informative enough to reconstruct, but regular enough that the latent space has meaningful structure for generation.

Proof Sketch

Start with the log-evidence and introduce qq:

logpθ(x)=logpθ(x,z)dz=logpθ(x,z)qϕ(zx)qϕ(zx)dz\log p_\theta(x) = \log \int p_\theta(x, z) \, dz = \log \int \frac{p_\theta(x, z)}{q_\phi(z|x)} q_\phi(z|x) \, dz

Apply Jensen's inequality (log\log is concave):

qϕ(zx)logpθ(x,z)qϕ(zx)dz=Eq[logpθ(x,z)]Eq[logqϕ(zx)]\geq \int q_\phi(z|x) \log \frac{p_\theta(x, z)}{q_\phi(z|x)} \, dz = \mathbb{E}_q[\log p_\theta(x, z)] - \mathbb{E}_q[\log q_\phi(z|x)]

Expand pθ(x,z)=pθ(xz)p(z)p_\theta(x, z) = p_\theta(x|z)p(z):

=Eq[logpθ(xz)]+Eq[logp(z)]Eq[logqϕ(zx)]= \mathbb{E}_q[\log p_\theta(x|z)] + \mathbb{E}_q[\log p(z)] - \mathbb{E}_q[\log q_\phi(z|x)]

=Eq[logpθ(xz)]DKL(qϕ(zx)p(z))= \mathbb{E}_q[\log p_\theta(x|z)] - D_{\text{KL}}(q_\phi(z|x) \| p(z))

Why It Matters

The ELBO transforms an intractable maximum likelihood problem into a tractable optimization. The gap between logp(x)\log p(x) and the ELBO is exactly DKL(qϕ(zx)pθ(zx))D_{\text{KL}}(q_\phi(z|x) \| p_\theta(z|x)). The approximation quality of the encoder. Maximizing the ELBO simultaneously fits the generative model and improves the approximate posterior.

Failure Mode

The ELBO can be loose if qϕq_\phi is too simple to approximate the true posterior (e.g., a diagonal Gaussian when the true posterior is multimodal). This leads to posterior collapse: the model ignores the latent variables (qp(z)q \approx p(z), KL 0\approx 0) and relies entirely on a powerful decoder.

An equivalent derivation shows the gap directly:

logpθ(x)=L(θ,ϕ;x)+DKL(qϕ(zx)pθ(zx))\log p_\theta(x) = \mathcal{L}(\theta, \phi; x) + D_{\text{KL}}(q_\phi(z|x) \| p_\theta(z|x))

Since KL divergence is non-negative, Llogpθ(x)\mathcal{L} \leq \log p_\theta(x).

The Reparameterization Trick

Definition

Reparameterization Trick

The reconstruction term Eqϕ(zx)[logpθ(xz)]\mathbb{E}_{q_\phi(z|x)}[\log p_\theta(x|z)] requires sampling zqϕ(zx)z \sim q_\phi(z|x), but we cannot backpropagate through a sampling operation.

The reparameterization trick expresses the sample as a deterministic function of the parameters and an independent noise variable:

z=μϕ(x)+σϕ(x)ε,εN(0,I)z = \mu_\phi(x) + \sigma_\phi(x) \odot \varepsilon, \quad \varepsilon \sim \mathcal{N}(0, I)

where μϕ(x)\mu_\phi(x) and σϕ(x)\sigma_\phi(x) are the mean and standard deviation output by the encoder network.

Now the randomness is in ε\varepsilon (which does not depend on ϕ\phi), and zz is a differentiable function of ϕ\phi. Standard backpropagation works.

Without reparameterization, you would need high-variance score function estimators (REINFORCE). The reparameterization trick gives low-variance gradient estimates, making VAE training practical.

The KL Term in Detail

For the standard VAE with Gaussian encoder qϕ(zx)=N(μ,diag(σ2))q_\phi(z|x) = \mathcal{N}(\mu, \text{diag}(\sigma^2)) and standard normal prior p(z)=N(0,I)p(z) = \mathcal{N}(0, I), the KL divergence has a closed form:

DKL(qp)=12j=1d(μj2+σj2logσj21)D_{\text{KL}}(q \| p) = \frac{1}{2}\sum_{j=1}^{d}\left(\mu_j^2 + \sigma_j^2 - \log \sigma_j^2 - 1\right)

This is computed analytically. no sampling needed. Each latent dimension contributes independently, making it easy to monitor which dimensions are active (significantly different from the prior).

Connection to EM

Definition

Amortized Variational EM

The classical EM algorithm for latent variable models alternates:

  • E-step: compute the posterior p(zx;θ)p(z|x; \theta) for each data point
  • M-step: maximize the expected complete log-likelihood

The VAE can be viewed as amortized variational EM:

  • The encoder qϕ(zx)q_\phi(z|x) replaces the E-step. It amortizes inference by learning a single network that works for all xx, rather than running separate optimization for each data point
  • The decoder pθ(xz)p_\theta(x|z) corresponds to the M-step
  • Both are optimized jointly via gradient descent on the ELBO

Classical variational inference computes a separate q(z)q(z) for each observation (expensive). Amortization is what makes VAEs scalable: one forward pass through the encoder gives the approximate posterior for any xx.

Common Confusions

Watch Out

The KL term is not just a penalty. It has a precise information-theoretic meaning

A common misunderstanding is that the KL term is a "regularizer" added for convenience, like weight decay. It is not. The KL term arises necessarily from the ELBO derivation. It measures how much information the encoder extracts about the specific input xx beyond what the prior already provides. Setting the KL weight to anything other than 1 (as in beta-VAE) changes the objective away from a valid lower bound on logp(x)\log p(x).

Watch Out

VAEs do not optimize reconstruction plus a penalty

The ELBO looks like "reconstruction - KL", which tempts people to treat it as a penalized autoencoder. But the correct interpretation is: the ELBO is a lower bound on the log-evidence, derived from first principles. The reconstruction and KL terms are not independent objectives. They are two parts of a single variational inference procedure. Changing their relative weight changes the probabilistic semantics.

Watch Out

Posterior collapse is not a bug in the ELBO

When a powerful autoregressive decoder can model p(x)p(x) without using zz, the optimal solution sets q(zx)=p(z)q(z|x) = p(z) (zero KL) and ignores the latent variables. This is actually the correct ELBO optimum. The model has discovered that latent variables are unnecessary. Whether this is desirable depends on whether you want meaningful latent representations (often yes) or just good p(x)p(x) (then it is fine).

Canonical Examples

Example

VAE on MNIST

Encoder: two-layer MLP mapping 7845122d784 \to 512 \to 2d (outputting μ\mu and logσ2\log\sigma^2, each dd-dimensional). Decoder: MLP mapping d512784d \to 512 \to 784 with sigmoid output (Bernoulli likelihood). With d=2d = 2, the latent space can be directly visualized: different digit classes cluster in different regions, and interpolating between two latent codes produces smooth morphing between digits.

Summary

  • The ELBO: logp(x)Eq[logp(xz)]DKL(q(zx)p(z))\log p(x) \geq \mathbb{E}_q[\log p(x|z)] - D_{\text{KL}}(q(z|x) \| p(z))
  • Gap between logp(x)\log p(x) and ELBO is DKL(q(zx)p(zx))D_{\text{KL}}(q(z|x) \| p(z|x))
  • Reparameterization trick: z=μ+σεz = \mu + \sigma \odot \varepsilon enables backprop through sampling
  • KL term has closed form for Gaussian qq and Gaussian prior
  • VAE = amortized variational EM: encoder amortizes the E-step
  • The KL term is not a regularizer. It is part of the variational bound

Exercises

ExerciseCore

Problem

Derive the closed-form KL divergence between q=N(μ,σ2)q = \mathcal{N}(\mu, \sigma^2) (univariate) and p=N(0,1)p = \mathcal{N}(0, 1).

ExerciseAdvanced

Problem

Show that logp(x)=L(θ,ϕ;x)+DKL(qϕ(zx)pθ(zx))\log p(x) = \mathcal{L}(\theta, \phi; x) + D_{\text{KL}}(q_\phi(z|x) \| p_\theta(z|x)). Use this to explain why maximizing the ELBO tightens the bound.

ExerciseResearch

Problem

In the beta-VAE, the objective is Eq[logp(xz)]βDKL(q(zx)p(z))\mathbb{E}_q[\log p(x|z)] - \beta \cdot D_{\text{KL}}(q(z|x) \| p(z)) with β>1\beta > 1. This is no longer a valid lower bound on logp(x)\log p(x). What is the beta-VAE actually optimizing from an information-theoretic perspective?

Related Comparisons

References

Canonical:

  • Kingma & Welling, "Auto-Encoding Variational Bayes" (2014). The original VAE paper
  • Rezende, Mohamed, Wierstra, "Stochastic Backpropagation and Approximate Inference" (2014)
  • Doersch, "Tutorial on Variational Autoencoders" (2016), arXiv:1606.05908

Current:

  • Kingma, "An Introduction to Variational Autoencoders" (2019). excellent tutorial

  • Bishop, Pattern Recognition and Machine Learning (2006), Chapters 1-14

Next Topics

The natural next steps from VAEs:

  • Diffusion models: a different approach to tractable generative modeling
  • Normalizing flows: exact likelihood via invertible transformations
  • Variational inference: the general framework behind the ELBO

Last reviewed: April 2026

Prerequisites

Foundations this topic depends on.

Builds on This

Next Topics