Paper breakdown

Auto-Encoding Variational Bayes

Diederik P. Kingma and Max Welling · 2013 · ICLR 2014

Introduces the variational autoencoder. Combines amortised inference with the reparameterisation trick to give a tractable, gradient-based estimator of the evidence lower bound for deep latent-variable models.

arXiv:1312.6114

Overview

Kingma and Welling (2013) gave variational inference a way to scale to deep neural networks. The paper poses two problems that the prior literature treated separately. First, posterior inference in deep latent-variable models is intractable: $p(z \mid x) \propto p(x \mid z) p(z)$ has no closed form when $p(x \mid z)$ is parameterised by a neural network. Second, even with a variational approximation $q_\phi(z \mid x)$ , the standard Monte Carlo estimator of the gradient of the ELBO has variance high enough to derail SGD.

The paper resolves both with two ideas applied jointly. The first is amortised inference: instead of fitting a separate $q$ for each datapoint, a single neural network $q_\phi(z \mid x)$ outputs the parameters of the variational posterior as a function of $x$ . The second is the reparameterisation trick: rewrite $z \sim q_\phi(z \mid x)$ as $z = g_\phi(\epsilon, x)$ for a differentiable $g$ and an auxiliary noise $\epsilon$ drawn from a fixed distribution. The gradient with respect to $\phi$ then flows through $g$ rather than through a sampling operation, giving a low-variance reparameterised estimator that is just standard backpropagation.

Mathematical Contributions

The evidence lower bound

For a latent-variable model with prior $p(z)$ , decoder $p_\theta(x \mid z)$ , and approximate posterior $q_\phi(z \mid x)$ , the marginal log-likelihood factors as:

$\log p_\theta(x) = \mathcal{L}(\theta, \phi; x) + D_{\text{KL}}\!\big(q_\phi(z \mid x)\,\|\,p_\theta(z \mid x)\big)$

where the evidence lower bound is:

$\mathcal{L}(\theta, \phi; x) = \mathbb{E}_{q_\phi(z \mid x)}\!\left[\log p_\theta(x \mid z)\right] - D_{\text{KL}}\!\big(q_\phi(z \mid x)\,\|\,p(z)\big)$

Because the KL divergence is non-negative, $\mathcal{L} \leq \log p_\theta(x)$ . Maximising $\mathcal{L}$ jointly in $\theta$ and $\phi$ tightens the bound and trains the decoder.

The reparameterisation trick

The naive Monte Carlo estimator of $\nabla_\phi \mathbb{E}_{q_\phi(z)}[f(z)]$ uses the score-function (REINFORCE) identity, which has variance that grows with the magnitude of $f$ . The paper rewrites the expectation by changing variables. Suppose $z \sim \mathcal{N}(\mu_\phi(x), \sigma_\phi^2(x))$ . Then $z = \mu_\phi(x) + \sigma_\phi(x) \odot \epsilon$ with $\epsilon \sim \mathcal{N}(0, I)$ , and:

$\nabla_\phi \mathbb{E}_{q_\phi(z \mid x)}[f(z)] = \mathbb{E}_{\epsilon \sim \mathcal{N}(0, I)}\!\left[\nabla_\phi f(\mu_\phi(x) + \sigma_\phi(x) \odot \epsilon)\right]$

The expectation is now over a fixed distribution, and the gradient passes through deterministic ops, so the estimator is unbiased and has variance comparable to a normal supervised gradient.

The closed-form KL term

For a Gaussian approximate posterior $q_\phi(z \mid x) = \mathcal{N}(\mu, \text{diag}(\sigma^2))$ and a standard normal prior $p(z) = \mathcal{N}(0, I)$ over $J$ latent dimensions, the KL term has a closed form:

$D_{\text{KL}}(q_\phi \| p) = -\tfrac{1}{2} \sum_{j=1}^J \big(1 + \log \sigma_j^2 - \mu_j^2 - \sigma_j^2\big)$

This removes one of the two Monte Carlo expectations entirely, leaving only the reconstruction term to estimate by sampling. In practice, a single sample per datapoint suffices.

The training objective

Combining the reparameterised reconstruction with the closed-form KL gives the SGVB objective for one data point:

$\tilde{\mathcal{L}}(\theta, \phi; x) = \log p_\theta(x \mid g_\phi(\epsilon, x)) - D_{\text{KL}}(q_\phi(z \mid x) \| p(z)), \quad \epsilon \sim \mathcal{N}(0, I)$

This is what is implemented in code. The reconstruction term is typically Gaussian for continuous data (giving an MSE loss) or Bernoulli for binary data (giving cross-entropy).

Connections to TheoremPath Topics

Variational autoencoders — the modern presentation including $\beta$ -VAE, posterior collapse, and free-bits.
Autoencoders — the deterministic precursor and how the VAE adds a probabilistic interpretation.
KL divergence — the divergence whose tractability under Gaussian families makes the closed-form KL term work.
Generative adversarial networks — the contemporaneous alternative for likelihood-free generation.
Cross-entropy loss — the standard reconstruction term for binary observation models.

Why It Matters Now

The VAE itself is no longer the dominant generative model for images or text; diffusion and autoregressive models produce sharper samples. But three contributions of the paper still shape current practice.

First, the reparameterisation trick is the foundation of every modern stochastic gradient estimator that does not use REINFORCE. Normalising flows, score-based diffusion models, and continuous control with stochastic policies all rely on it. Without reparameterisation, training would collapse to high-variance Monte Carlo and reinforcement-style techniques, which scale poorly.

Second, amortised inference — replacing per-datapoint optimisation with a feedforward network that predicts variational parameters — generalised beyond VAEs. It is now standard in stochastic-process models, in implicit attention mechanisms, and in any system that needs to map an observation to a distribution at inference time.

Third, the latent-space structure VAEs produce remains a useful representation-learning baseline. Disentanglement work, $\beta$ -VAEs, and the "manifold hypothesis" experimental literature all use this paper's setup.

References

Canonical:

Kingma, D. P., & Welling, M. (2014). "Auto-Encoding Variational Bayes." ICLR. arXiv:1312.6114.
Rezende, D. J., Mohamed, S., & Wierstra, D. (2014). "Stochastic Backpropagation and Approximate Inference in Deep Generative Models." ICML. arXiv:1401.4082.

Tutorials:

Kingma, D. P., & Welling, M. (2019). "An Introduction to Variational Autoencoders." Foundations and Trends in Machine Learning. arXiv:1906.02691.
Doersch, C. (2016). "Tutorial on Variational Autoencoders." arXiv:1606.05908.

Direct extensions:

Higgins, I. et al. (2017). " $\beta$ -VAE: Learning Basic Visual Concepts with a Constrained Variational Framework." ICLR.
Sønderby, C. K., Raiko, T., Maaløe, L., Sønderby, S. K., & Winther, O. (2016). "Ladder Variational Autoencoders." NeurIPS. arXiv:1602.02282.
van den Oord, A., Vinyals, O., & Kavukcuoglu, K. (2017). "Neural Discrete Representation Learning." NeurIPS. arXiv:1711.00937.

Background:

Bishop, C. M. (2006). Pattern Recognition and Machine Learning. Springer. Chapter 10 (variational inference).
Murphy, K. P. (2023). Probabilistic Machine Learning: Advanced Topics. MIT Press. Chapters 21, 27.

Connected topics

Last reviewed: May 5, 2026