ML Methods
EM Algorithm Variants
Variants of EM for when the standard algorithm is intractable: Monte Carlo EM, Variational EM, Stochastic EM, and ECM. Connection to VAEs as amortized variational EM.
Prerequisites
Why This Matters
Standard EM requires computing the exact posterior in the E-step and solving a closed-form maximization in the M-step. For most interesting models, at least one of these is intractable. EM variants replace exact computation with approximation, trading statistical efficiency for computational feasibility. Understanding these variants is necessary for understanding VAEs, modern Bayesian deep learning, and any latent variable model beyond Gaussian mixtures.
Mental Model
Standard EM alternates between computing the expected complete-data log-likelihood (E-step) and maximizing it (M-step). Each variant relaxes one or both steps:
- Monte Carlo EM: approximate the E-step expectation with MCMC samples
- Variational EM: replace the exact posterior with a tractable approximation
- Stochastic EM: sample a single latent configuration instead of computing an expectation
- ECM: break the M-step into simpler conditional maximizations
Formal Setup
Recall the EM objective. Given observed data and latent variables , EM maximizes the ELBO:
where is the entropy of . Standard EM sets in the E-step (making the ELBO tight) and maximizes over in the M-step.
Monte Carlo EM
Monte Carlo EM (MCEM)
Replace the E-step expectation with a Monte Carlo average. Draw from using MCMC, then approximate:
Maximize this approximate in the M-step.
MCEM is useful when the posterior can be sampled (via Gibbs or Metropolis-Hastings) but the expectation under it has no closed form. The number of MCMC samples must grow across iterations to guarantee convergence.
MCEM Convergence
Statement
If the number of MCMC samples at iteration satisfies as , then the MCEM sequence converges to a stationary point of the observed-data log-likelihood under standard regularity conditions.
Intuition
Early iterations use few samples (the parameter estimate is rough anyway). Later iterations use more samples for precision. The increasing sample size ensures that the Monte Carlo noise vanishes as the parameters approach a fixed point.
Proof Sketch
Show that the approximate function converges uniformly to the true as (by the law of large numbers for MCMC). Then apply the standard EM convergence proof (monotonicity of the likelihood) with a perturbation argument for the Monte Carlo error.
Why It Matters
MCEM makes EM applicable to models where the E-step integral is intractable, such as mixed-effects models with non-conjugate priors or complex spatial models.
Failure Mode
If does not grow fast enough, the Monte Carlo noise can prevent convergence. If the MCMC sampler for the E-step has poor mixing, the samples are correlated and the effective sample size is much smaller than . In practice, monitoring the MCMC convergence within each E-step is critical.
Variational EM
Variational EM
Instead of computing exactly, restrict to a tractable family (e.g., fully factored distributions) and maximize the ELBO over both and :
Variational E-step:
M-step:
The variational E-step minimizes within the family . Because is restricted, the ELBO is no longer tight, so variational EM maximizes a lower bound on the log-likelihood rather than the log-likelihood itself.
Variational EM Monotonically Increases the ELBO
Statement
Each iteration of variational EM increases (or preserves) the ELBO:
Intuition
The variational E-step increases the ELBO by optimizing over (holding fixed). The M-step increases it by optimizing over (holding fixed). Each step can only improve or maintain the objective.
Proof Sketch
By construction, since maximizes over . Then since maximizes over .
Why It Matters
Variational EM is the conceptual ancestor of VAEs. The VAE replaces the per-example optimization of with an amortized inference network that maps any input to an approximate posterior. Training the VAE is variational EM with the E-step amortized by a neural network.
Failure Mode
The ELBO is a lower bound on the log-likelihood, not the log-likelihood itself. Variational EM can converge to a point that maximizes the ELBO but is far from the MLE if the variational family is too restrictive. Mean-field approximations can miss posterior correlations entirely.
Stochastic EM and ECM
Stochastic EM replaces the E-step expectation with a single draw from . The M-step maximizes the complete-data log-likelihood at that single imputed . This introduces noise but avoids computing expectations entirely. The parameter sequence does not converge to a point; it converges to a stationary distribution around a local maximum.
ECM (Expectation Conditional Maximization) replaces the M-step with several conditional maximization steps, each optimizing over a subset of parameters while holding the rest fixed. This is useful when the joint M-step has no closed form but the conditional M-steps do.
Connection to VAEs
The VAE training objective is:
This is variational EM where the variational family is parameterized by a neural network . The E-step (optimizing ) and M-step (optimizing ) happen simultaneously via gradient descent. The reparameterization trick makes the expectation over differentiable with respect to .
Common Confusions
Variational EM is not the same as variational inference
Variational inference optimizes for fixed (or with integrated out). Variational EM alternates between optimizing and . The distinction matters: variational inference is Bayesian (treats as random), while variational EM gives a point estimate of .
MCEM does not require the posterior to have a closed form
The whole point of MCEM is that you sample from the posterior using MCMC rather than computing it analytically. You need to be able to sample from , not compute it in closed form.
Summary
- MCEM: approximate E-step with MCMC; increase sample size across iterations
- Variational EM: restrict to a tractable family; maximizes ELBO, not log-likelihood
- Stochastic EM: single sample E-step; converges to a distribution, not a point
- ECM: break M-step into conditional maximizations
- VAE = amortized variational EM with neural network encoder
Exercises
Problem
In variational EM with a mean-field approximation , why might the approximate posterior miss important structure in the true posterior? Give a concrete example.
Problem
Suppose you run MCEM with a fixed number of MCMC samples at every iteration. Can you guarantee convergence to the MLE? Why or why not?
References
Canonical:
- Wei & Tanner, "A Monte Carlo Implementation of the EM Algorithm," JASA 85(411), 1990
- Meng & Rubin, "Maximum Likelihood Estimation via the ECM Algorithm," Biometrika 80(2), 1993
Current:
-
Kingma & Welling, "Auto-Encoding Variational Bayes," ICLR 2014
-
Blei, Kucukelbir, McAuliffe, "Variational Inference: A Review for Statisticians," JASA 112(518), 2017
-
Bishop, Pattern Recognition and Machine Learning (2006), Chapters 1-14
Next Topics
From EM variants, the natural continuations are:
- Diffusion models: also use variational lower bounds with latent variable structure
- Autoencoders: VAEs are the amortized variational EM connection made concrete
Last reviewed: April 2026
Prerequisites
Foundations this topic depends on.
- The EM AlgorithmLayer 2
- Maximum Likelihood EstimationLayer 0B
- Common Probability DistributionsLayer 0A
- Sets, Functions, and RelationsLayer 0A
- Basic Logic and Proof TechniquesLayer 0A
- Differentiation in RnLayer 0A
- Convex Optimization BasicsLayer 1
- Matrix Operations and PropertiesLayer 0A