EM and Variants
The EM Algorithm
Expectation-Maximization: the principled way to do maximum likelihood when some variables are unobserved. Derives the ELBO, proves monotonic convergence, and shows why EM is the backbone of latent variable models.
Why This Matters
The EM algorithm is one of the most important algorithms in statistical machine learning. Whenever your model contains latent (unobserved) variables, you cannot simply write down the log-likelihood and differentiate. the latent variables couple the parameters in a way that makes direct optimization intractable. EM sidesteps this by alternating between inferring the latent variables and updating the parameters.
Gaussian mixture models, hidden Markov models, factor analysis, probabilistic PCA, topic models. all of these are classically fitted with EM. And the core idea. maximizing a lower bound on the log-likelihood. is exactly the principle behind variational autoencoders and modern variational inference.
If you understand EM deeply, you understand the conceptual engine behind a huge fraction of probabilistic machine learning.
Mental Model
You want to maximize the log-likelihood , but there are hidden variables that make this hard. Direct marginalization puts a sum inside a log, which is analytically and computationally painful.
EM's trick: instead of maximizing the log-likelihood directly, construct a lower bound (the ELBO) that is easy to compute and maximize. Iterate: tighten the bound (E-step), then maximize it (M-step). Each iteration is guaranteed to increase (or not decrease) the true log-likelihood.
The Missing-Data Formulation
Complete-Data Log-Likelihood
The complete-data log-likelihood is , where is the observed data and is the latent (missing) data. If we knew , optimization over would typically be tractable: for exponential-family models it reduces to sufficient-statistic matching with a closed-form solution.
Incomplete-Data Log-Likelihood
The incomplete-data log-likelihood (or marginal log-likelihood) is:
This is what we actually want to maximize, but the sum (or integral) inside the log makes it hard.
The fundamental problem: we want to maximize , but we can only easily work with . EM bridges this gap.
The ELBO Derivation
This is the mathematical heart of EM. For any distribution over the latent variables:
where the inequality is Jensen's inequality applied to the concave function .
Evidence Lower Bound (ELBO)
The ELBO (Evidence Lower BOund) is:
It is a lower bound on for any . Equivalently:
where is the entropy of .
ELBO Decomposition
Statement
For any distribution and parameters :
where is the Kullback-Leibler divergence.
Intuition
The log-likelihood decomposes exactly into the ELBO plus the KL divergence between and the true posterior . Since KL divergence is always , the ELBO is always . The bound is tight (equality holds) when . that is, when equals the true posterior.
Proof Sketch
Start from the ELBO definition and expand:
Rearranging gives the decomposition.
Why It Matters
This decomposition is the raison d'etre of EM. It tells you that to make the ELBO as tight as possible (i.e., equal to the log-likelihood), you should set . That is exactly what the E-step does. Then maximizing the ELBO over (the M-step) increases the log-likelihood.
Failure Mode
If the true posterior is intractable to compute, you cannot run the E-step exactly. This is the setting where you need variational EM (restrict to a tractable family) or Monte Carlo EM (approximate the expectation by sampling).
The EM Algorithm
With the ELBO in hand, EM is simply coordinate ascent on :
Initialize .
E-step (iteration ): Set .
This makes the bound tight: .
M-step (iteration ): Set .
Since is now fixed, this is equivalent to:
because does not depend on .
The quantity is called the Q-function in the classical EM literature.
Monotonic Convergence
EM Monotonically Increases the Likelihood
Statement
At each iteration of EM:
with equality if and only if is already a fixed point of EM.
Intuition
The E-step makes the ELBO equal to the log-likelihood. The M-step increases the ELBO (or keeps it the same). Since the ELBO was equal to the log-likelihood at , and we increased it, the new log-likelihood at must be at least as large.
Proof Sketch
After the E-step: .
After the M-step: by definition of maximization.
But because the ELBO is always a lower bound.
Chaining:
Why It Matters
Monotonic convergence is a strong guarantee: EM never makes things worse. Combined with the fact that the log-likelihood is bounded above (for well-defined models), this ensures that EM converges to some stationary point. But it does not guarantee convergence to the global maximum.
Failure Mode
If the M-step only partially maximizes the Q-function (generalized EM), you still get monotonic convergence as long as . But convergence can be very slow, especially near saddle points.
Why Jensen's Inequality Is the Key
Jensen's inequality states that for a concave function :
In the ELBO derivation, we apply this to , which is concave. The "random variable" is , and the expectation is with respect to .
Without Jensen's inequality, we would have no lower bound, no ELBO, and no EM. The entire algorithm rests on the concavity of the logarithm.
Canonical Example: Gaussian Mixture Models
EM for Gaussian Mixture Models
A GMM models data as arising from Gaussian components:
where and are mixing weights with .
The latent variable indicates which component generated datapoint .
E-step: Compute responsibilities (posterior over component assignments):
M-step: Update parameters using the responsibilities as soft assignments:
Each iteration is guaranteed to increase (or maintain) the log-likelihood. The algorithm converges to a local maximum of the GMM likelihood.
HMMs and the Baum-Welch Algorithm
The Baum-Welch algorithm for training hidden Markov models is a special case of EM. The E-step uses the forward-backward algorithm to compute posterior marginals and pairwise posteriors over the hidden states. The M-step re-estimates transition probabilities and emission parameters using these posteriors as soft counts.
The structure of the HMM makes the E-step tractable despite an exponential number of latent sequences, because the forward-backward algorithm exploits the chain structure via dynamic programming.
Common Confusions
EM does NOT find global optima
EM finds local optima (more precisely, stationary points) of the log-likelihood. The monotonic convergence theorem guarantees that the likelihood never decreases, but it says nothing about reaching the global maximum. For multi-modal likelihoods (which GMMs almost always have), the solution depends heavily on initialization. In practice, people run EM with multiple random restarts and take the solution with highest likelihood.
The E-step is expectation, not sampling
A common misconception is that the E-step involves sampling from the posterior . It does not. The E-step computes as a function of . For discrete latent variables, this means computing exact posterior probabilities. For continuous latent variables, it means computing posterior moments analytically. If you sample instead of computing exact expectations, you are doing Monte Carlo EM, which is a different (stochastic) algorithm with different convergence properties.
EM is not just for mixtures
While GMMs are the canonical example, EM applies to any model with latent variables: factor analysis, probabilistic PCA, hidden Markov models, latent Dirichlet allocation (with variational E-step), missing data imputation, and many more. The missing-data formulation is the general framework.
Summary
- EM maximizes a lower bound (the ELBO) on the log-likelihood
- The E-step sets , making the bound tight
- The M-step maximizes over
- Each iteration monotonically increases the log-likelihood
- Convergence is to a local optimum, not necessarily the global optimum
- Jensen's inequality on the concave function is the foundational step
- For exponential family models, the M-step often has closed-form solutions
Exercises
Problem
Consider a two-component GMM in 1D with equal mixing weights , known variance , and unknown means . You observe a single datapoint .
(a) Write the E-step: compute .
(b) Write the M-step update for and given datapoints.
(c) If and , what are and for ?
Problem
Prove that if the M-step only partially maximizes the Q-function. that is, but is not necessarily the global maximizer. The monotonic convergence property still holds. This is the Generalized EM (GEM) algorithm.
References
Canonical:
- Dempster, Laird, Rubin, "Maximum Likelihood from Incomplete Data via the EM Algorithm" (1977). The original paper
- Bishop, Pattern Recognition and Machine Learning, Chapter 9
- McLachlan & Krishnan, The EM Algorithm and Extensions (2008)
Current:
-
Murphy, Probabilistic Machine Learning: An Introduction (2022), Chapter 8
-
Neal & Hinton, "A View of the EM Algorithm that Justifies Incremental, Sparse, and Other Variants" (1998)
-
Hastie, Tibshirani, Friedman, The Elements of Statistical Learning (2009)
Next Topics
The natural next steps from EM:
- EM algorithm variants: Monte Carlo EM, variational EM, stochastic EM, and generalized EM for intractable E-steps and M-steps
Last reviewed: April 2026
Prerequisites
Foundations this topic depends on.
- Maximum Likelihood EstimationLayer 0B
- Common Probability DistributionsLayer 0A
- Sets, Functions, and RelationsLayer 0A
- Basic Logic and Proof TechniquesLayer 0A
- Differentiation in RnLayer 0A
- Convex Optimization BasicsLayer 1
- Matrix Operations and PropertiesLayer 0A
Builds on This
- EM Algorithm VariantsLayer 3
- Gaussian Mixture Models and EMLayer 2