Paper breakdown
Auto-Encoding Variational Bayes
Diederik P. Kingma and Max Welling · 2013 · ICLR 2014
Introduces the variational autoencoder. Combines amortised inference with the reparameterisation trick to give a tractable, gradient-based estimator of the evidence lower bound for deep latent-variable models.
Overview
Kingma and Welling (2013) gave variational inference a way to scale to deep neural networks. The paper poses two problems that the prior literature treated separately. First, posterior inference in deep latent-variable models is intractable: has no closed form when is parameterised by a neural network. Second, even with a variational approximation , the standard Monte Carlo estimator of the gradient of the ELBO has variance high enough to derail SGD.
The paper resolves both with two ideas applied jointly. The first is amortised inference: instead of fitting a separate for each datapoint, a single neural network outputs the parameters of the variational posterior as a function of . The second is the reparameterisation trick: rewrite as for a differentiable and an auxiliary noise drawn from a fixed distribution. The gradient with respect to then flows through rather than through a sampling operation, giving a low-variance reparameterised estimator that is just standard backpropagation.
Mathematical Contributions
The evidence lower bound
For a latent-variable model with prior , decoder , and approximate posterior , the marginal log-likelihood factors as:
where the evidence lower bound is:
Because the KL divergence is non-negative, . Maximising jointly in and tightens the bound and trains the decoder.
The reparameterisation trick
The naive Monte Carlo estimator of uses the score-function (REINFORCE) identity, which has variance that grows with the magnitude of . The paper rewrites the expectation by changing variables. Suppose . Then with , and:
The expectation is now over a fixed distribution, and the gradient passes through deterministic ops, so the estimator is unbiased and has variance comparable to a normal supervised gradient.
The closed-form KL term
For a Gaussian approximate posterior and a standard normal prior over latent dimensions, the KL term has a closed form:
This removes one of the two Monte Carlo expectations entirely, leaving only the reconstruction term to estimate by sampling. In practice, a single sample per datapoint suffices.
The training objective
Combining the reparameterised reconstruction with the closed-form KL gives the SGVB objective for one data point:
This is what is implemented in code. The reconstruction term is typically Gaussian for continuous data (giving an MSE loss) or Bernoulli for binary data (giving cross-entropy).
Connections to TheoremPath Topics
- Variational autoencoders — the modern presentation including -VAE, posterior collapse, and free-bits.
- Autoencoders — the deterministic precursor and how the VAE adds a probabilistic interpretation.
- KL divergence — the divergence whose tractability under Gaussian families makes the closed-form KL term work.
- Generative adversarial networks — the contemporaneous alternative for likelihood-free generation.
- Cross-entropy loss — the standard reconstruction term for binary observation models.
Why It Matters Now
The VAE itself is no longer the dominant generative model for images or text; diffusion and autoregressive models produce sharper samples. But three contributions of the paper still shape current practice.
First, the reparameterisation trick is the foundation of every modern stochastic gradient estimator that does not use REINFORCE. Normalising flows, score-based diffusion models, and continuous control with stochastic policies all rely on it. Without reparameterisation, training would collapse to high-variance Monte Carlo and reinforcement-style techniques, which scale poorly.
Second, amortised inference — replacing per-datapoint optimisation with a feedforward network that predicts variational parameters — generalised beyond VAEs. It is now standard in stochastic-process models, in implicit attention mechanisms, and in any system that needs to map an observation to a distribution at inference time.
Third, the latent-space structure VAEs produce remains a useful representation-learning baseline. Disentanglement work, -VAEs, and the "manifold hypothesis" experimental literature all use this paper's setup.
References
Canonical:
- Kingma, D. P., & Welling, M. (2014). "Auto-Encoding Variational Bayes." ICLR. arXiv:1312.6114.
- Rezende, D. J., Mohamed, S., & Wierstra, D. (2014). "Stochastic Backpropagation and Approximate Inference in Deep Generative Models." ICML. arXiv:1401.4082.
Tutorials:
- Kingma, D. P., & Welling, M. (2019). "An Introduction to Variational Autoencoders." Foundations and Trends in Machine Learning. arXiv:1906.02691.
- Doersch, C. (2016). "Tutorial on Variational Autoencoders." arXiv:1606.05908.
Direct extensions:
- Higgins, I. et al. (2017). "-VAE: Learning Basic Visual Concepts with a Constrained Variational Framework." ICLR.
- Sønderby, C. K., Raiko, T., Maaløe, L., Sønderby, S. K., & Winther, O. (2016). "Ladder Variational Autoencoders." NeurIPS. arXiv:1602.02282.
- van den Oord, A., Vinyals, O., & Kavukcuoglu, K. (2017). "Neural Discrete Representation Learning." NeurIPS. arXiv:1711.00937.
Background:
- Bishop, C. M. (2006). Pattern Recognition and Machine Learning. Springer. Chapter 10 (variational inference).
- Murphy, K. P. (2023). Probabilistic Machine Learning: Advanced Topics. MIT Press. Chapters 21, 27.
Connected topics
Last reviewed: May 5, 2026