EM Algorithm Variants

Sneiderman, Robby

ML Methods

EM Algorithm Variants

Variants of EM for when the standard algorithm is intractable: Monte Carlo EM, Variational EM, Stochastic EM, and ECM. Connection to VAEs as amortized variational EM.

AdvancedTier 2StableSupporting~50 min

Prerequisites

EM Algorithm

Start 8-question practice · 1 available Prereq Map

Learning position

Read this page in the graph.

ml-methods | layer 3 | tier 2. This page has 1 direct prerequisite and 2 published dependents.

Open Atlas Prerequisites Leads to

What next

Diffusion Models

This is the first curated or graph-derived continuation from the current page.

Evidence badge

Claim status

This page has no public Lean mapping yet. Use the evidence page to inspect how claim status labels work.

Show the backing system

AtlasOpen the full prerequisite graph and run grounding traces.EvidenceInspect source support, claim labels, and public trust status.LeanReview the checked declaration list, scopes, and axiom profile.

Why This Matters

Standard EM requires computing the exact posterior $p(z \mid x, \theta)$ in the E-step and solving a closed-form maximization in the M-step. For most interesting models, at least one of these is intractable. EM variants replace exact computation with approximation, trading statistical efficiency for computational feasibility. Understanding these variants is necessary for understanding VAEs, modern Bayesian deep learning, and any latent variable model beyond Gaussian mixtures.

Mental Model

Standard EM alternates between computing the expected complete-data log-likelihood (E-step) and maximizing it (M-step). Each variant relaxes one or both steps:

Monte Carlo EM: approximate the E-step expectation with MCMC samples
Variational EM: replace the exact posterior with a tractable approximation
Stochastic EM: sample a single latent configuration instead of computing an expectation
ECM: break the M-step into simpler conditional maximizations

Formal Setup

Recall the EM objective. Given observed data $x$ and latent variables $z$ , EM maximizes the ELBO:

$\mathcal{L}(\theta, q) = \mathbb{E}_{q(z)}[\log p(x, z \mid \theta)] + H(q)$

where $H(q)$ is the entropy of $q$ . Standard EM sets $q(z) = p(z \mid x, \theta^{(t)})$ in the E-step (making the ELBO tight) and maximizes over $\theta$ in the M-step.

Monte Carlo EM

Definition

Monte Carlo EM (MCEM)

Replace the E-step expectation with a Monte Carlo average. Draw $z^{(1)}, \ldots, z^{(M)}$ from $p(z \mid x, \theta^{(t)})$ using MCMC, then approximate:

$Q(\theta \mid \theta^{(t)}) \approx \frac{1}{M} \sum_{m=1}^{M} \log p(x, z^{(m)} \mid \theta)$

Maximize this approximate $Q$ in the M-step.

MCEM is useful when the posterior $p(z \mid x, \theta)$ can be sampled (via Gibbs or Metropolis-Hastings) but the expectation under it has no closed form. The number of MCMC samples $M$ must grow across iterations to guarantee convergence.

Theorem

MCEM Convergence

Statement

If the number of MCMC samples $M_t$ at iteration $t$ satisfies $M_t \to \infty$ as $t \to \infty$ , then the MCEM sequence $\theta^{(1)}, \theta^{(2)}, \ldots$ converges to a stationary point of the observed-data log-likelihood under standard regularity conditions.

Intuition

Early iterations use few samples (the parameter estimate is rough anyway). Later iterations use more samples for precision. The increasing sample size ensures that the Monte Carlo noise vanishes as the parameters approach a fixed point.

Proof Sketch

Show that the approximate $Q$ function converges uniformly to the true $Q$ as $M \to \infty$ (by the law of large numbers for MCMC). Then apply the standard EM convergence proof (monotonicity of the likelihood) with a perturbation argument for the Monte Carlo error.

Why It Matters

MCEM makes EM applicable to models where the E-step integral is intractable, such as mixed-effects models with non-conjugate priors or complex spatial models.

Failure Mode

If $M_t$ does not grow fast enough, the Monte Carlo noise can prevent convergence. If the MCMC sampler for the E-step has poor mixing, the samples are correlated and the effective sample size is much smaller than $M_t$ . In practice, monitoring the MCMC convergence within each E-step is critical.

report a correction →

Variational EM

Definition

Variational EM

Instead of computing $p(z \mid x, \theta)$ exactly, restrict $q$ to a tractable family $\mathcal{Q}$ (e.g., fully factored distributions) and maximize the ELBO over both $q$ and $\theta$ :

Variational E-step: $q^{(t+1)} = \arg\max_{q \in \mathcal{Q}} \mathcal{L}(\theta^{(t)}, q)$

M-step: $\theta^{(t+1)} = \arg\max_{\theta} \mathcal{L}(\theta, q^{(t+1)})$

The variational E-step minimizes $\text{KL}(q \| p(z \mid x, \theta))$ within the family $\mathcal{Q}$ . Because $q$ is restricted, the ELBO is no longer tight, so variational EM maximizes a lower bound on the log-likelihood rather than the log-likelihood itself.

Theorem

Variational EM Monotonically Increases the ELBO

Statement

Each iteration of variational EM increases (or preserves) the ELBO:

$\mathcal{L}(\theta^{(t+1)}, q^{(t+1)}) \geq \mathcal{L}(\theta^{(t)}, q^{(t)})$

Intuition

The variational E-step increases the ELBO by optimizing over $q$ (holding $\theta$ fixed). The M-step increases it by optimizing over $\theta$ (holding $q$ fixed). Each step can only improve or maintain the objective.

Proof Sketch

By construction, $\mathcal{L}(\theta^{(t)}, q^{(t+1)}) \geq \mathcal{L}(\theta^{(t)}, q^{(t)})$ since $q^{(t+1)}$ maximizes over $q$ . Then $\mathcal{L}(\theta^{(t+1)}, q^{(t+1)}) \geq \mathcal{L}(\theta^{(t)}, q^{(t+1)})$ since $\theta^{(t+1)}$ maximizes over $\theta$ .

Why It Matters

Variational EM is the conceptual ancestor of VAEs. The VAE replaces the per-example optimization of $q$ with an amortized inference network $q_\phi(z \mid x)$ that maps any input $x$ to an approximate posterior. Training the VAE is variational EM with the E-step amortized by a neural network.

Failure Mode

The ELBO is a lower bound on the log-likelihood, not the log-likelihood itself. Variational EM can converge to a point that maximizes the ELBO but is far from the MLE if the variational family $\mathcal{Q}$ is too restrictive. Mean-field approximations can miss posterior correlations entirely.

report a correction →

Stochastic EM and ECM

Stochastic EM replaces the E-step expectation with a single draw from $p(z \mid x, \theta^{(t)})$ . The M-step maximizes the complete-data log-likelihood at that single imputed $z$ . This introduces noise but avoids computing expectations entirely. The parameter sequence does not converge to a point; it converges to a stationary distribution around a local maximum.

ECM (Expectation Conditional Maximization) replaces the M-step with several conditional maximization steps, each optimizing over a subset of parameters while holding the rest fixed. This is useful when the joint M-step has no closed form but the conditional M-steps do.

Connection to VAEs

The VAE training objective is:

$\max_{\theta, \phi} \; \mathbb{E}_{x \sim \text{data}} \left[ \mathbb{E}_{q_\phi(z|x)}[\log p_\theta(x \mid z)] - \text{KL}(q_\phi(z \mid x) \| p(z)) \right]$

This is variational EM where the variational family is parameterized by a neural network $q_\phi$ . The E-step (optimizing $\phi$ ) and M-step (optimizing $\theta$ ) happen simultaneously via gradient descent. The reparameterization trick makes the expectation over $q_\phi$ differentiable with respect to $\phi$ .

Common Confusions

Watch Out

Variational EM is not the same as variational inference

Variational inference optimizes $q$ for fixed $\theta$ (or with $\theta$ integrated out). Variational EM alternates between optimizing $q$ and $\theta$ . The distinction matters: variational inference is Bayesian (treats $\theta$ as random), while variational EM gives a point estimate of $\theta$ .

Watch Out

MCEM does not require the posterior to have a closed form

The whole point of MCEM is that you sample from the posterior using MCMC rather than computing it analytically. You need to be able to sample from $p(z \mid x, \theta)$ , not compute it in closed form.

Summary

MCEM: approximate E-step with MCMC; increase sample size across iterations
Variational EM: restrict $q$ to a tractable family; maximizes ELBO, not log-likelihood
Stochastic EM: single sample E-step; converges to a distribution, not a point
ECM: break M-step into conditional maximizations
VAE = amortized variational EM with neural network encoder

Exercises

ExerciseCore

Problem

In variational EM with a mean-field approximation $q(z) = \prod_i q_i(z_i)$ , why might the approximate posterior miss important structure in the true posterior? Give a concrete example.

ExerciseAdvanced

Problem

Suppose you run MCEM with a fixed number of MCMC samples $M = 100$ at every iteration. Can you guarantee convergence to the MLE? Why or why not?

References

Canonical:

Wei & Tanner, "A Monte Carlo Implementation of the EM Algorithm," JASA 85(411), 1990
Meng & Rubin, "Maximum Likelihood Estimation via the ECM Algorithm," Biometrika 80(2), 1993

Current:

Kingma & Welling, "Auto-Encoding Variational Bayes," ICLR 2014
Blei, Kucukelbir, McAuliffe, "Variational Inference: A Review for Statisticians," JASA 112(518), 2017
Bishop, Pattern Recognition and Machine Learning (2006), Chapter 9 (Mixture Models and EM), Chapter 10 (Approximate Inference)

Next Topics

From EM variants, the natural continuations are:

Diffusion models: also use variational lower bounds with latent variable structure
Autoencoders: VAEs are the amortized variational EM connection made concrete

Last reviewed: April 14, 2026

Canonical graph

Required before and derived from this topic

These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.

Full prerequisite chain All derived topics

Required prerequisites

1

The EM Algorithmlayer 2 · tier 1

Derived topics

2

Diffusion Modelslayer 4 · tier 1
Autoencoderslayer 2 · tier 2

Graph-backed continuations

Diffusion Models Autoencoders