Skip to main content

Paper breakdown

Auto-Encoding Variational Bayes

Diederik P. Kingma and Max Welling · 2013 · ICLR 2014

Introduces the variational autoencoder. Combines amortised inference with the reparameterisation trick to give a tractable, gradient-based estimator of the evidence lower bound for deep latent-variable models.

Overview

Kingma and Welling (2013) gave variational inference a way to scale to deep neural networks. The paper poses two problems that the prior literature treated separately. First, posterior inference in deep latent-variable models is intractable: p(zx)p(xz)p(z)p(z \mid x) \propto p(x \mid z) p(z) has no closed form when p(xz)p(x \mid z) is parameterised by a neural network. Second, even with a variational approximation qϕ(zx)q_\phi(z \mid x), the standard Monte Carlo estimator of the gradient of the ELBO has variance high enough to derail SGD.

The paper resolves both with two ideas applied jointly. The first is amortised inference: instead of fitting a separate qq for each datapoint, a single neural network qϕ(zx)q_\phi(z \mid x) outputs the parameters of the variational posterior as a function of xx. The second is the reparameterisation trick: rewrite zqϕ(zx)z \sim q_\phi(z \mid x) as z=gϕ(ϵ,x)z = g_\phi(\epsilon, x) for a differentiable gg and an auxiliary noise ϵ\epsilon drawn from a fixed distribution. The gradient with respect to ϕ\phi then flows through gg rather than through a sampling operation, giving a low-variance reparameterised estimator that is just standard backpropagation.

Mathematical Contributions

The evidence lower bound

For a latent-variable model with prior p(z)p(z), decoder pθ(xz)p_\theta(x \mid z), and approximate posterior qϕ(zx)q_\phi(z \mid x), the marginal log-likelihood factors as:

logpθ(x)=L(θ,ϕ;x)+DKL ⁣(qϕ(zx)pθ(zx))\log p_\theta(x) = \mathcal{L}(\theta, \phi; x) + D_{\text{KL}}\!\big(q_\phi(z \mid x)\,\|\,p_\theta(z \mid x)\big)

where the evidence lower bound is:

L(θ,ϕ;x)=Eqϕ(zx) ⁣[logpθ(xz)]DKL ⁣(qϕ(zx)p(z))\mathcal{L}(\theta, \phi; x) = \mathbb{E}_{q_\phi(z \mid x)}\!\left[\log p_\theta(x \mid z)\right] - D_{\text{KL}}\!\big(q_\phi(z \mid x)\,\|\,p(z)\big)

Because the KL divergence is non-negative, Llogpθ(x)\mathcal{L} \leq \log p_\theta(x). Maximising L\mathcal{L} jointly in θ\theta and ϕ\phi tightens the bound and trains the decoder.

The reparameterisation trick

The naive Monte Carlo estimator of ϕEqϕ(z)[f(z)]\nabla_\phi \mathbb{E}_{q_\phi(z)}[f(z)] uses the score-function (REINFORCE) identity, which has variance that grows with the magnitude of ff. The paper rewrites the expectation by changing variables. Suppose zN(μϕ(x),σϕ2(x))z \sim \mathcal{N}(\mu_\phi(x), \sigma_\phi^2(x)). Then z=μϕ(x)+σϕ(x)ϵz = \mu_\phi(x) + \sigma_\phi(x) \odot \epsilon with ϵN(0,I)\epsilon \sim \mathcal{N}(0, I), and:

ϕEqϕ(zx)[f(z)]=EϵN(0,I) ⁣[ϕf(μϕ(x)+σϕ(x)ϵ)]\nabla_\phi \mathbb{E}_{q_\phi(z \mid x)}[f(z)] = \mathbb{E}_{\epsilon \sim \mathcal{N}(0, I)}\!\left[\nabla_\phi f(\mu_\phi(x) + \sigma_\phi(x) \odot \epsilon)\right]

The expectation is now over a fixed distribution, and the gradient passes through deterministic ops, so the estimator is unbiased and has variance comparable to a normal supervised gradient.

The closed-form KL term

For a Gaussian approximate posterior qϕ(zx)=N(μ,diag(σ2))q_\phi(z \mid x) = \mathcal{N}(\mu, \text{diag}(\sigma^2)) and a standard normal prior p(z)=N(0,I)p(z) = \mathcal{N}(0, I) over JJ latent dimensions, the KL term has a closed form:

DKL(qϕp)=12j=1J(1+logσj2μj2σj2)D_{\text{KL}}(q_\phi \| p) = -\tfrac{1}{2} \sum_{j=1}^J \big(1 + \log \sigma_j^2 - \mu_j^2 - \sigma_j^2\big)

This removes one of the two Monte Carlo expectations entirely, leaving only the reconstruction term to estimate by sampling. In practice, a single sample per datapoint suffices.

The training objective

Combining the reparameterised reconstruction with the closed-form KL gives the SGVB objective for one data point:

L~(θ,ϕ;x)=logpθ(xgϕ(ϵ,x))DKL(qϕ(zx)p(z)),ϵN(0,I)\tilde{\mathcal{L}}(\theta, \phi; x) = \log p_\theta(x \mid g_\phi(\epsilon, x)) - D_{\text{KL}}(q_\phi(z \mid x) \| p(z)), \quad \epsilon \sim \mathcal{N}(0, I)

This is what is implemented in code. The reconstruction term is typically Gaussian for continuous data (giving an MSE loss) or Bernoulli for binary data (giving cross-entropy).

Connections to TheoremPath Topics

  • Variational autoencoders — the modern presentation including β\beta-VAE, posterior collapse, and free-bits.
  • Autoencoders — the deterministic precursor and how the VAE adds a probabilistic interpretation.
  • KL divergence — the divergence whose tractability under Gaussian families makes the closed-form KL term work.
  • Generative adversarial networks — the contemporaneous alternative for likelihood-free generation.
  • Cross-entropy loss — the standard reconstruction term for binary observation models.

Why It Matters Now

The VAE itself is no longer the dominant generative model for images or text; diffusion and autoregressive models produce sharper samples. But three contributions of the paper still shape current practice.

First, the reparameterisation trick is the foundation of every modern stochastic gradient estimator that does not use REINFORCE. Normalising flows, score-based diffusion models, and continuous control with stochastic policies all rely on it. Without reparameterisation, training would collapse to high-variance Monte Carlo and reinforcement-style techniques, which scale poorly.

Second, amortised inference — replacing per-datapoint optimisation with a feedforward network that predicts variational parameters — generalised beyond VAEs. It is now standard in stochastic-process models, in implicit attention mechanisms, and in any system that needs to map an observation to a distribution at inference time.

Third, the latent-space structure VAEs produce remains a useful representation-learning baseline. Disentanglement work, β\beta-VAEs, and the "manifold hypothesis" experimental literature all use this paper's setup.

References

Canonical:

  • Kingma, D. P., & Welling, M. (2014). "Auto-Encoding Variational Bayes." ICLR. arXiv:1312.6114.
  • Rezende, D. J., Mohamed, S., & Wierstra, D. (2014). "Stochastic Backpropagation and Approximate Inference in Deep Generative Models." ICML. arXiv:1401.4082.

Tutorials:

  • Kingma, D. P., & Welling, M. (2019). "An Introduction to Variational Autoencoders." Foundations and Trends in Machine Learning. arXiv:1906.02691.
  • Doersch, C. (2016). "Tutorial on Variational Autoencoders." arXiv:1606.05908.

Direct extensions:

  • Higgins, I. et al. (2017). "β\beta-VAE: Learning Basic Visual Concepts with a Constrained Variational Framework." ICLR.
  • Sønderby, C. K., Raiko, T., Maaløe, L., Sønderby, S. K., & Winther, O. (2016). "Ladder Variational Autoencoders." NeurIPS. arXiv:1602.02282.
  • van den Oord, A., Vinyals, O., & Kavukcuoglu, K. (2017). "Neural Discrete Representation Learning." NeurIPS. arXiv:1711.00937.

Background:

  • Bishop, C. M. (2006). Pattern Recognition and Machine Learning. Springer. Chapter 10 (variational inference).
  • Murphy, K. P. (2023). Probabilistic Machine Learning: Advanced Topics. MIT Press. Chapters 21, 27.

Connected topics

Last reviewed: May 5, 2026