The EM Algorithm

Sneiderman, Robby

Statistical Estimation

The EM Algorithm

Expectation-Maximization: the principled way to do maximum likelihood when some variables are unobserved. Derives the ELBO, proves monotonic convergence, and shows why EM is the backbone of latent variable models.

CoreTier 1StableSupporting~40 min

Prerequisites

Maximum Likelihood Estimation Convex Optimization Basics Sufficient Statistics and Exponential Families Conjugate Priors

Start 8-question practice · 1 available Prereq Map

Learning position

Read this page in the graph.

statistical-estimation | layer 2 | tier 1. This page has 4 direct prerequisites and 2 published dependents.

Open Atlas Prerequisites Leads to

What next

EM Algorithm Variants

This is the first curated or graph-derived continuation from the current page.

Evidence badge

Claim status

This page has no public Lean mapping yet. Use the evidence page to inspect how claim status labels work.

Show the backing system

AtlasOpen the full prerequisite graph and run grounding traces.EvidenceInspect source support, claim labels, and public trust status.LeanReview the checked declaration list, scopes, and axiom profile.

Why This Matters

The EM algorithm is one of the most important algorithms in statistical machine learning. Whenever your model contains latent (unobserved) variables, you cannot simply write down the log-likelihood and differentiate. The latent variables couple the parameters in a way that makes direct optimization intractable. EM sidesteps this by alternating between inferring the latent variables and updating the parameters.

Gaussian mixture models, hidden Markov models, factor analysis, probabilistic PCA, topic models. all of these are classically fitted with EM. And the core idea. maximizing a lower bound on the log-likelihood. is exactly the principle behind variational autoencoders and modern variational inference.

If you understand EM deeply, you understand the conceptual engine behind a huge fraction of probabilistic machine learning.

Mental Model

You want to maximize the log-likelihood $\log p(x \mid \theta)$ , but there are hidden variables $z$ that make this hard. Direct marginalization $\log \sum_z p(x, z \mid \theta)$ puts a sum inside a log, which is analytically and computationally painful.

EM's trick: instead of maximizing the log-likelihood directly, construct a lower bound (the ELBO) that is easy to compute and maximize. Iterate: tighten the bound (E-step), then maximize it (M-step). Each iteration is guaranteed to increase (or not decrease) the true log-likelihood.

The Missing-Data Formulation

Definition

Complete-Data Log-Likelihood $lo g p (x, z ∣ θ)$

The complete-data log-likelihood is $\log p(x, z \mid \theta)$ , where $x$ is the observed data and $z$ is the latent (missing) data. If we knew $z$ , optimization over $\theta$ would typically be tractable: for exponential-family models it reduces to sufficient-statistic matching with a closed-form solution.

Definition

Incomplete-Data Log-Likelihood $lo g p (x ∣ θ)$

The incomplete-data log-likelihood (or marginal log-likelihood) is:

$\log p(x \mid \theta) = \log \sum_z p(x, z \mid \theta)$

This is what we actually want to maximize, but the sum (or integral) inside the log makes it hard.

The fundamental problem: we want to maximize $\log p(x \mid \theta)$ , but we can only easily work with $\log p(x, z \mid \theta)$ . EM bridges this gap.

The ELBO Derivation

This is the mathematical heart of EM. For any distribution $q(z)$ over the latent variables:

$\log p(x \mid \theta) = \log \sum_z p(x, z \mid \theta)$

$= \log \sum_z q(z) \frac{p(x, z \mid \theta)}{q(z)}$

$\geq \sum_z q(z) \log \frac{p(x, z \mid \theta)}{q(z)}$

where the inequality is Jensen's inequality applied to the concave function $\log(\cdot)$ .

Definition

Evidence Lower Bound (ELBO) $L (q, θ)$

The ELBO (Evidence Lower BOund) is:

$\mathcal{L}(q, \theta) = \sum_z q(z) \log \frac{p(x, z \mid \theta)}{q(z)} = \mathbb{E}_{q}[\log p(x, z \mid \theta)] - \mathbb{E}_{q}[\log q(z)]$

It is a lower bound on $\log p(x \mid \theta)$ for any $q$ . Equivalently:

$\mathcal{L}(q, \theta) = \mathbb{E}_{q}[\log p(x, z \mid \theta)] + H(q)$

where $H(q) = -\sum_z q(z) \log q(z)$ is the entropy of $q$ .

Theorem

ELBO Decomposition

Statement

For any distribution $q(z)$ and parameters $\theta$ :

$\log p(x \mid \theta) = \mathcal{L}(q, \theta) + D_{\mathrm{KL}}(q(z) \| p(z \mid x, \theta))$

where $D_{\mathrm{KL}}$ is the Kullback-Leibler divergence.

Intuition

The log-likelihood decomposes exactly into the ELBO plus the KL divergence between $q(z)$ and the true posterior $p(z \mid x, \theta)$ . Since KL divergence is always $\geq 0$ , the ELBO is always $\leq \log p(x \mid \theta)$ . The bound is tight (equality holds) when $q(z) = p(z \mid x, \theta)$ . that is, when $q$ equals the true posterior.

Proof Sketch

Start from the ELBO definition and expand:

$\mathcal{L}(q, \theta) = \sum_z q(z) \log \frac{p(x, z \mid \theta)}{q(z)}$ $= \sum_z q(z) \log \frac{p(z \mid x, \theta) p(x \mid \theta)}{q(z)}$ $= \sum_z q(z) \log p(x \mid \theta) + \sum_z q(z) \log \frac{p(z \mid x, \theta)}{q(z)}$ $= \log p(x \mid \theta) - D_{\mathrm{KL}}(q(z) \| p(z \mid x, \theta))$

Rearranging gives the decomposition.

Why It Matters

This decomposition is the raison d'etre of EM. It tells you that to make the ELBO as tight as possible (i.e., equal to the log-likelihood), you should set $q(z) = p(z \mid x, \theta)$ . That is exactly what the E-step does. Then maximizing the ELBO over $\theta$ (the M-step) increases the log-likelihood.

Failure Mode

If the true posterior $p(z \mid x, \theta)$ is intractable to compute, you cannot run the E-step exactly. This is the setting where you need variational EM (restrict $q$ to a tractable family) or Monte Carlo EM (approximate the expectation by sampling).

report a correction →

The EM Algorithm

With the ELBO in hand, EM is simply coordinate ascent on $\mathcal{L}(q, \theta)$ :

Initialize $\theta^{(0)}$ .

E-step (iteration $t$ ): Set $q^{(t)}(z) = p(z \mid x, \theta^{(t)})$ .

This makes the bound tight: $\mathcal{L}(q^{(t)}, \theta^{(t)}) = \log p(x \mid \theta^{(t)})$ .

M-step (iteration $t$ ): Set $\theta^{(t+1)} = \arg\max_{\theta} \mathcal{L}(q^{(t)}, \theta)$ .

Since $q^{(t)}$ is now fixed, this is equivalent to:

$\theta^{(t+1)} = \arg\max_{\theta} \mathbb{E}_{q^{(t)}}[\log p(x, z \mid \theta)]$

because $H(q^{(t)})$ does not depend on $\theta$ .

The quantity $Q(\theta, \theta^{(t)}) = \mathbb{E}_{p(z \mid x, \theta^{(t)})}[\log p(x, z \mid \theta)]$ is called the Q-function in the classical EM literature.

Monotonic Convergence

Theorem

EM Monotonically Increases the Likelihood

Statement

At each iteration of EM:

$\log p(x \mid \theta^{(t+1)}) \geq \log p(x \mid \theta^{(t)})$

with equality if and only if $\theta^{(t)}$ is already a fixed point of EM (a stationary point of the M-step, given the current posterior). Note that "fixed point" here means a stationary point of the EM update, which need not be a local maximum of the log-likelihood: saddle points and constrained-boundary points can satisfy the equality without being optima.

Intuition

The E-step makes the ELBO equal to the log-likelihood. The M-step increases the ELBO (or keeps it the same). Since the ELBO was equal to the log-likelihood at $\theta^{(t)}$ , and we increased it, the new log-likelihood at $\theta^{(t+1)}$ must be at least as large.

Proof Sketch

After the E-step: $\mathcal{L}(q^{(t)}, \theta^{(t)}) = \log p(x \mid \theta^{(t)})$ .

After the M-step: $\mathcal{L}(q^{(t)}, \theta^{(t+1)}) \geq \mathcal{L}(q^{(t)}, \theta^{(t)})$ by definition of maximization.

But $\mathcal{L}(q^{(t)}, \theta^{(t+1)}) \leq \log p(x \mid \theta^{(t+1)})$ because the ELBO is always a lower bound.

Chaining: $\log p(x \mid \theta^{(t+1)}) \geq \mathcal{L}(q^{(t)}, \theta^{(t+1)}) \geq \mathcal{L}(q^{(t)}, \theta^{(t)}) = \log p(x \mid \theta^{(t)})$

Why It Matters

Monotonic convergence is a strong guarantee: EM never makes things worse. Combined with the fact that the log-likelihood is bounded above (for well-defined models), this ensures that EM converges to some stationary point. But it does not guarantee convergence to the global maximum.

Failure Mode

If the M-step only partially maximizes the Q-function (generalized EM), you still get monotonic convergence as long as $Q(\theta^{(t+1)}, \theta^{(t)}) \geq Q(\theta^{(t)}, \theta^{(t)})$ . But convergence can be very slow, especially near saddle points.

report a correction →

Why Jensen's Inequality Is the Key

Jensen's inequality states that for a concave function $f$ :

$f(\mathbb{E}[X]) \geq \mathbb{E}[f(X)]$

In the ELBO derivation, we apply this to $f = \log$ , which is concave. The "random variable" is $p(x, z \mid \theta) / q(z)$ , and the expectation is with respect to $q(z)$ .

Without Jensen's inequality, we would have no lower bound, no ELBO, and no EM. The entire algorithm rests on the concavity of the logarithm.

Canonical Example: Gaussian Mixture Models

Example

EM for Gaussian Mixture Models

A GMM models data as arising from $K$ Gaussian components:

$p(x \mid \theta) = \sum_{k=1}^{K} \pi_k \, \mathcal{N}(x \mid \mu_k, \Sigma_k)$

where $\theta = \{\pi_k, \mu_k, \Sigma_k\}_{k=1}^K$ and $\pi_k$ are mixing weights with $\sum_k \pi_k = 1$ .

The latent variable $z_i \in \{1, \ldots, K\}$ indicates which component generated datapoint $x_i$ .

E-step: Compute responsibilities (posterior over component assignments):

$\gamma_{ik} = p(z_i = k \mid x_i, \theta^{(t)}) = \frac{\pi_k^{(t)} \, \mathcal{N}(x_i \mid \mu_k^{(t)}, \Sigma_k^{(t)})}{\sum_{j=1}^K \pi_j^{(t)} \, \mathcal{N}(x_i \mid \mu_j^{(t)}, \Sigma_j^{(t)})}$

M-step: Update parameters using the responsibilities as soft assignments:

$N_k = \sum_{i=1}^n \gamma_{ik}$

$\mu_k^{(t+1)} = \frac{1}{N_k} \sum_{i=1}^n \gamma_{ik} \, x_i$

$\Sigma_k^{(t+1)} = \frac{1}{N_k} \sum_{i=1}^n \gamma_{ik} (x_i - \mu_k^{(t+1)})(x_i - \mu_k^{(t+1)})^\top$

$\pi_k^{(t+1)} = \frac{N_k}{n}$

Each iteration is guaranteed to increase (or maintain) the log-likelihood. The algorithm converges to a local maximum of the GMM likelihood.

Worked Numeric Trace: Two-Component GMM in 1D

Take five datapoints $x = (-2, -1, 0, 1, 2)$ and a two-component GMM with $\sigma^2 = 1$ , equal mixing weights $\pi_1 = \pi_2 = 0.5$ , and starting means $\mu_1^{(0)} = -1.5$ , $\mu_2^{(0)} = 1.5$ . The log-likelihood at iteration $t$ is $\ell(\theta^{(t)}) = \sum_i \log \sum_k \pi_k \mathcal{N}(x_i \mid \mu_k^{(t)}, 1)$ .

Iteration $t$	$\mu_1^{(t)}$	$\mu_2^{(t)}$	$\ell(\theta^{(t)})$
0	$-1.500$	$\phantom{-}1.500$	$-9.142$
1	$-1.351$	$\phantom{-}1.351$	$-9.094$
2	$-1.305$	$\phantom{-}1.305$	$-9.087$
3	$-1.293$	$\phantom{-}1.293$	$-9.086$
5	$-1.288$	$\phantom{-}1.288$	$-9.086$
$\infty$	$-1.287$	$\phantom{-}1.287$	$-9.086$

Three observations from the trace:

Monotonic increase. $\ell$ never decreases. This is the formal content of the convergence theorem.
Linear convergence near the optimum. The change in $\ell$ shrinks geometrically: roughly $0.048 \to 0.007 \to 0.001 \to \cdots$ . The rate of geometric decay is what Dempster-Laird-Rubin (1977) call the fraction of missing information.
Symmetric fixed point. The two means converge symmetrically because the data is symmetric around 0; with non-symmetric data or asymmetric initialization, the means would converge to non-symmetric values.

Try the same starting condition with $\mu_1^{(0)} = -3$ , $\mu_2^{(0)} = -1$ instead, and EM converges to a different local optimum (both means near the centroid of the data, with equal responsibilities everywhere — the non-identifiable fixed point). This sensitivity to initialization is why people run EM from multiple random starts.

Convergence Rate and the Fraction of Missing Information

EM converges to a stationary point of the log-likelihood, but how fast? Dempster-Laird-Rubin (1977) gave the answer: locally near the optimum, EM is a linear-convergence iteration, with rate determined by the fraction of missing information.

Let $\mathcal{I}_{\text{obs}}(\theta^*) = -\nabla^2 \ell(\theta^*)$ be the observed-data Fisher information at the optimum, and let $\mathcal{I}_{\text{comp}}(\theta^*) = -\mathbb{E}_{p(z \mid x, \theta^*)}[\nabla^2 \log p(x, z \mid \theta^*)]$ be the complete-data Fisher information. Define the matrix

M(\theta^*) = I - \mathcal{I}_{\text{comp}}^{-1}(\theta^*) \mathcal{I}_{\text{obs}}(\theta^*).

This matrix is the fraction of missing information. Its eigenvalues lie in $[0, 1)$ . The asymptotic linear convergence rate of EM is the largest eigenvalue of $M(\theta^*)$ .

Fraction of missing information	EM convergence rate	Practical consequence
Near 0 (almost no missing data)	Very fast	Few iterations, near-Newton speed
Near 0.5	Moderate	Tens of iterations; standard regime
Near 1 (heavily missing)	Very slow	Hundreds of iterations; consider acceleration (Aitken, EM, or stochastic variants)

For GMMs with well-separated components, the fraction of missing information is small (cluster identity is almost determined by the data), and EM converges fast. For GMMs with overlapping components, the fraction can approach 1, and convergence becomes painfully slow. This is why practitioners run multiple restarts and use acceleration schemes for hard problems.

The takeaway: EM is guaranteed to converge but only linearly. Methods with super-linear convergence (Newton, quasi-Newton on the log-likelihood directly) often outperform EM near the optimum, at the cost of needing the full Hessian.

EM as a Majorize-Minimize Algorithm

EM is a special case of the broader majorize-minimize (MM) framework. The recipe:

At each iteration, construct a minorizer $g(\theta \mid \theta^{(t)})$ of the objective $\ell(\theta)$ that touches $\ell$ at $\theta^{(t)}$ and lies below it elsewhere.
Maximize the minorizer to get $\theta^{(t+1)}$ .

For EM, the minorizer is $\mathcal{L}(q^{(t)}, \theta) = \mathbb{E}_{p(z \mid x, \theta^{(t)})}[\log p(x, z \mid \theta)] + H(q^{(t)})$ . The E-step constructs the minorizer; the M-step maximizes it. Monotonic convergence follows automatically from the MM structure.

This view unifies EM with related algorithms: iteratively reweighted least squares, the proximal gradient method, Bregman-divergence proximal updates, and many others. Anything that can be cast as MM inherits monotonic convergence and (under regularity) linear convergence rates. See Lange (2016) for a systematic treatment.

HMMs and the Baum-Welch Algorithm

The Baum-Welch algorithm for training hidden Markov models is a special case of EM. The E-step uses the forward-backward algorithm to compute posterior marginals $p(z_t \mid x_{1:T}, \theta)$ and pairwise posteriors $p(z_t, z_{t+1} \mid x_{1:T}, \theta)$ over the hidden states. The M-step re-estimates transition probabilities and emission parameters using these posteriors as soft counts.

The structure of the HMM makes the E-step tractable despite an exponential number of latent sequences, because the forward-backward algorithm exploits the chain structure via dynamic programming.

Common Confusions

Watch Out

EM does NOT find global optima

EM finds local optima (more precisely, stationary points) of the log-likelihood. The monotonic convergence theorem guarantees that the likelihood never decreases, but it says nothing about reaching the global maximum. For multi-modal likelihoods (which GMMs almost always have), the solution depends heavily on initialization. In practice, people run EM with multiple random restarts and take the solution with highest likelihood.

Watch Out

The E-step is expectation, not sampling

A common misconception is that the E-step involves sampling from the posterior $p(z \mid x, \theta)$ . It does not. The E-step computes $\mathbb{E}_{p(z \mid x, \theta^{(t)})}[\log p(x, z \mid \theta)]$ as a function of $\theta$ . For discrete latent variables, this means computing exact posterior probabilities. For continuous latent variables, it means computing posterior moments analytically. If you sample instead of computing exact expectations, you are doing Monte Carlo EM, which is a different (stochastic) algorithm with different convergence properties.

Watch Out

EM is not just for mixtures

While GMMs are the canonical example, EM applies to any model with latent variables: factor analysis, probabilistic PCA, hidden Markov models, latent Dirichlet allocation (with variational E-step), missing data imputation, and many more. The missing-data formulation is the general framework.

Summary

EM maximizes a lower bound (the ELBO) on the log-likelihood
The E-step sets $q(z) = p(z \mid x, \theta^{(t)})$ , making the bound tight
The M-step maximizes $\mathbb{E}_q[\log p(x, z \mid \theta)]$ over $\theta$
Each iteration monotonically increases the log-likelihood
Convergence is to a local optimum, not necessarily the global optimum
Jensen's inequality on the concave $\log$ function is the foundational step
For exponential family models, the M-step often has closed-form solutions

Exercises

ExerciseCore

Problem

Consider a two-component GMM in 1D with equal mixing weights $\pi_1 = \pi_2 = 0.5$ , known variance $\sigma^2 = 1$ , and unknown means $\mu_1, \mu_2$ . You observe a single datapoint $x = 0$ .

(a) Write the E-step: compute $\gamma_1 = p(z = 1 \mid x = 0, \mu_1^{(t)}, \mu_2^{(t)})$ .

(b) Write the M-step update for $\mu_1$ and $\mu_2$ given $n$ datapoints.

(c) If $\mu_1^{(0)} = -1$ and $\mu_2^{(0)} = 1$ , what are $\gamma_1$ and $\gamma_2$ for $x = 0$ ?

ExerciseAdvanced

Problem

Prove that if the M-step only partially maximizes the Q-function. that is, $Q(\theta^{(t+1)}, \theta^{(t)}) \geq Q(\theta^{(t)}, \theta^{(t)})$ but $\theta^{(t+1)}$ is not necessarily the global maximizer. The monotonic convergence property still holds. This is the Generalized EM (GEM) algorithm.

References

Canonical:

Dempster, A. P., Laird, N. M., & Rubin, D. B. (1977). Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society, Series B, 39(1), 1-38. The original paper; Section 4 introduces the fraction-of-missing-information matrix and the linear-convergence-rate theorem.
Bishop, C. M. (2006). Pattern Recognition and Machine Learning. Springer. Chapter 9 covers EM, GMMs, and the variational view.
McLachlan, G. J., & Krishnan, T. (2008). The EM Algorithm and Extensions (2nd ed.). Wiley. The standard reference monograph; covers convergence theory, accelerated variants, and extensions in detail.
Wu, C. F. J. (1983). On the convergence properties of the EM algorithm. Annals of Statistics, 11(1), 95-103. Resolved gaps in Dempster-Laird-Rubin's convergence proof; gave precise conditions under which EM converges to a stationary point (continuity assumptions on $Q$ ).

Current:

Murphy, K. P. (2022). Probabilistic Machine Learning: An Introduction. MIT Press. Chapter 8 covers EM, mixtures, and the connection to variational inference.
Neal, R. M., & Hinton, G. E. (1998). A view of the EM algorithm that justifies incremental, sparse, and other variants. In Learning in Graphical Models (pp. 355-368). Springer. Reframes EM as coordinate ascent on the ELBO; the conceptual foundation for variational inference.
Hastie, T., Tibshirani, R., & Friedman, J. (2009). The Elements of Statistical Learning (2nd ed.). Springer. Chapter 8 covers EM via the missing-data perspective with a clear GMM derivation.
Lange, K. (2016). MM Optimization Algorithms. SIAM. Treats EM as a special case of the majorize-minimize framework; chapters 1-3 give the unifying view used in the section above.

Frontier:

Balakrishnan, S., Wainwright, M. J., & Yu, B. (2017). Statistical guarantees for the EM algorithm: From population to sample-based analysis. Annals of Statistics, 45(1), 77-120. The modern non-asymptotic analysis: explicit conditions under which EM converges to a near-optimal estimator at near-parametric rate, including for high-dimensional mixtures.

Next Topics

The natural next steps from EM:

EM algorithm variants: Monte Carlo EM, variational EM, stochastic EM, and generalized EM for intractable E-steps and M-steps

Last reviewed: May 4, 2026

Canonical graph

Required before and derived from this topic

These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.

Full prerequisite chain All derived topics

Required prerequisites

4

Conjugate Priorslayer 0B · tier 1
Maximum Likelihood Estimation: Theory, Information Identity, and Asymptotic Efficiencylayer 0B · tier 1
Convex Optimization Basicslayer 1 · tier 1
Sufficient Statistics and Exponential Familieslayer 0B · tier 2

Derived topics

2

Gaussian Mixture Models and EMlayer 2 · tier 2
EM Algorithm Variantslayer 3 · tier 2

Graph-backed continuations

EM Algorithm Variants Gaussian Mixture Models and EM