Beta. Content is under active construction and has not been peer-reviewed. Report errors on GitHub.Disclaimer

ML Methods

EM Algorithm Variants

Variants of EM for when the standard algorithm is intractable: Monte Carlo EM, Variational EM, Stochastic EM, and ECM. Connection to VAEs as amortized variational EM.

AdvancedTier 2Stable~50 min

Prerequisites

0

Why This Matters

Standard EM requires computing the exact posterior p(zx,θ)p(z \mid x, \theta) in the E-step and solving a closed-form maximization in the M-step. For most interesting models, at least one of these is intractable. EM variants replace exact computation with approximation, trading statistical efficiency for computational feasibility. Understanding these variants is necessary for understanding VAEs, modern Bayesian deep learning, and any latent variable model beyond Gaussian mixtures.

Mental Model

Standard EM alternates between computing the expected complete-data log-likelihood (E-step) and maximizing it (M-step). Each variant relaxes one or both steps:

  • Monte Carlo EM: approximate the E-step expectation with MCMC samples
  • Variational EM: replace the exact posterior with a tractable approximation
  • Stochastic EM: sample a single latent configuration instead of computing an expectation
  • ECM: break the M-step into simpler conditional maximizations

Formal Setup

Recall the EM objective. Given observed data xx and latent variables zz, EM maximizes the ELBO:

L(θ,q)=Eq(z)[logp(x,zθ)]+H(q)\mathcal{L}(\theta, q) = \mathbb{E}_{q(z)}[\log p(x, z \mid \theta)] + H(q)

where H(q)H(q) is the entropy of qq. Standard EM sets q(z)=p(zx,θ(t))q(z) = p(z \mid x, \theta^{(t)}) in the E-step (making the ELBO tight) and maximizes over θ\theta in the M-step.

Monte Carlo EM

Definition

Monte Carlo EM (MCEM)

Replace the E-step expectation with a Monte Carlo average. Draw z(1),,z(M)z^{(1)}, \ldots, z^{(M)} from p(zx,θ(t))p(z \mid x, \theta^{(t)}) using MCMC, then approximate:

Q(θθ(t))1Mm=1Mlogp(x,z(m)θ)Q(\theta \mid \theta^{(t)}) \approx \frac{1}{M} \sum_{m=1}^{M} \log p(x, z^{(m)} \mid \theta)

Maximize this approximate QQ in the M-step.

MCEM is useful when the posterior p(zx,θ)p(z \mid x, \theta) can be sampled (via Gibbs or Metropolis-Hastings) but the expectation under it has no closed form. The number of MCMC samples MM must grow across iterations to guarantee convergence.

Theorem

MCEM Convergence

Statement

If the number of MCMC samples MtM_t at iteration tt satisfies MtM_t \to \infty as tt \to \infty, then the MCEM sequence θ(1),θ(2),\theta^{(1)}, \theta^{(2)}, \ldots converges to a stationary point of the observed-data log-likelihood under standard regularity conditions.

Intuition

Early iterations use few samples (the parameter estimate is rough anyway). Later iterations use more samples for precision. The increasing sample size ensures that the Monte Carlo noise vanishes as the parameters approach a fixed point.

Proof Sketch

Show that the approximate QQ function converges uniformly to the true QQ as MM \to \infty (by the law of large numbers for MCMC). Then apply the standard EM convergence proof (monotonicity of the likelihood) with a perturbation argument for the Monte Carlo error.

Why It Matters

MCEM makes EM applicable to models where the E-step integral is intractable, such as mixed-effects models with non-conjugate priors or complex spatial models.

Failure Mode

If MtM_t does not grow fast enough, the Monte Carlo noise can prevent convergence. If the MCMC sampler for the E-step has poor mixing, the samples are correlated and the effective sample size is much smaller than MtM_t. In practice, monitoring the MCMC convergence within each E-step is critical.

Variational EM

Definition

Variational EM

Instead of computing p(zx,θ)p(z \mid x, \theta) exactly, restrict qq to a tractable family Q\mathcal{Q} (e.g., fully factored distributions) and maximize the ELBO over both qq and θ\theta:

Variational E-step: q(t+1)=argmaxqQL(θ(t),q)q^{(t+1)} = \arg\max_{q \in \mathcal{Q}} \mathcal{L}(\theta^{(t)}, q)

M-step: θ(t+1)=argmaxθL(θ,q(t+1))\theta^{(t+1)} = \arg\max_{\theta} \mathcal{L}(\theta, q^{(t+1)})

The variational E-step minimizes KL(qp(zx,θ))\text{KL}(q \| p(z \mid x, \theta)) within the family Q\mathcal{Q}. Because qq is restricted, the ELBO is no longer tight, so variational EM maximizes a lower bound on the log-likelihood rather than the log-likelihood itself.

Theorem

Variational EM Monotonically Increases the ELBO

Statement

Each iteration of variational EM increases (or preserves) the ELBO:

L(θ(t+1),q(t+1))L(θ(t),q(t))\mathcal{L}(\theta^{(t+1)}, q^{(t+1)}) \geq \mathcal{L}(\theta^{(t)}, q^{(t)})

Intuition

The variational E-step increases the ELBO by optimizing over qq (holding θ\theta fixed). The M-step increases it by optimizing over θ\theta (holding qq fixed). Each step can only improve or maintain the objective.

Proof Sketch

By construction, L(θ(t),q(t+1))L(θ(t),q(t))\mathcal{L}(\theta^{(t)}, q^{(t+1)}) \geq \mathcal{L}(\theta^{(t)}, q^{(t)}) since q(t+1)q^{(t+1)} maximizes over qq. Then L(θ(t+1),q(t+1))L(θ(t),q(t+1))\mathcal{L}(\theta^{(t+1)}, q^{(t+1)}) \geq \mathcal{L}(\theta^{(t)}, q^{(t+1)}) since θ(t+1)\theta^{(t+1)} maximizes over θ\theta.

Why It Matters

Variational EM is the conceptual ancestor of VAEs. The VAE replaces the per-example optimization of qq with an amortized inference network qϕ(zx)q_\phi(z \mid x) that maps any input xx to an approximate posterior. Training the VAE is variational EM with the E-step amortized by a neural network.

Failure Mode

The ELBO is a lower bound on the log-likelihood, not the log-likelihood itself. Variational EM can converge to a point that maximizes the ELBO but is far from the MLE if the variational family Q\mathcal{Q} is too restrictive. Mean-field approximations can miss posterior correlations entirely.

Stochastic EM and ECM

Stochastic EM replaces the E-step expectation with a single draw from p(zx,θ(t))p(z \mid x, \theta^{(t)}). The M-step maximizes the complete-data log-likelihood at that single imputed zz. This introduces noise but avoids computing expectations entirely. The parameter sequence does not converge to a point; it converges to a stationary distribution around a local maximum.

ECM (Expectation Conditional Maximization) replaces the M-step with several conditional maximization steps, each optimizing over a subset of parameters while holding the rest fixed. This is useful when the joint M-step has no closed form but the conditional M-steps do.

Connection to VAEs

The VAE training objective is:

maxθ,ϕ  Exdata[Eqϕ(zx)[logpθ(xz)]KL(qϕ(zx)p(z))]\max_{\theta, \phi} \; \mathbb{E}_{x \sim \text{data}} \left[ \mathbb{E}_{q_\phi(z|x)}[\log p_\theta(x \mid z)] - \text{KL}(q_\phi(z \mid x) \| p(z)) \right]

This is variational EM where the variational family is parameterized by a neural network qϕq_\phi. The E-step (optimizing ϕ\phi) and M-step (optimizing θ\theta) happen simultaneously via gradient descent. The reparameterization trick makes the expectation over qϕq_\phi differentiable with respect to ϕ\phi.

Common Confusions

Watch Out

Variational EM is not the same as variational inference

Variational inference optimizes qq for fixed θ\theta (or with θ\theta integrated out). Variational EM alternates between optimizing qq and θ\theta. The distinction matters: variational inference is Bayesian (treats θ\theta as random), while variational EM gives a point estimate of θ\theta.

Watch Out

MCEM does not require the posterior to have a closed form

The whole point of MCEM is that you sample from the posterior using MCMC rather than computing it analytically. You need to be able to sample from p(zx,θ)p(z \mid x, \theta), not compute it in closed form.

Summary

  • MCEM: approximate E-step with MCMC; increase sample size across iterations
  • Variational EM: restrict qq to a tractable family; maximizes ELBO, not log-likelihood
  • Stochastic EM: single sample E-step; converges to a distribution, not a point
  • ECM: break M-step into conditional maximizations
  • VAE = amortized variational EM with neural network encoder

Exercises

ExerciseCore

Problem

In variational EM with a mean-field approximation q(z)=iqi(zi)q(z) = \prod_i q_i(z_i), why might the approximate posterior miss important structure in the true posterior? Give a concrete example.

ExerciseAdvanced

Problem

Suppose you run MCEM with a fixed number of MCMC samples M=100M = 100 at every iteration. Can you guarantee convergence to the MLE? Why or why not?

References

Canonical:

  • Wei & Tanner, "A Monte Carlo Implementation of the EM Algorithm," JASA 85(411), 1990
  • Meng & Rubin, "Maximum Likelihood Estimation via the ECM Algorithm," Biometrika 80(2), 1993

Current:

  • Kingma & Welling, "Auto-Encoding Variational Bayes," ICLR 2014

  • Blei, Kucukelbir, McAuliffe, "Variational Inference: A Review for Statisticians," JASA 112(518), 2017

  • Bishop, Pattern Recognition and Machine Learning (2006), Chapters 1-14

Next Topics

From EM variants, the natural continuations are:

  • Diffusion models: also use variational lower bounds with latent variable structure
  • Autoencoders: VAEs are the amortized variational EM connection made concrete

Last reviewed: April 2026

Prerequisites

Foundations this topic depends on.

Next Topics