Beta. Content is under active construction and has not been peer-reviewed. Report errors on GitHub.Disclaimer

EM and Variants

The EM Algorithm

Expectation-Maximization: the principled way to do maximum likelihood when some variables are unobserved. Derives the ELBO, proves monotonic convergence, and shows why EM is the backbone of latent variable models.

CoreTier 1Stable~80 min

Why This Matters

Key: EM never decreases log-likelihoodEMEMEMconverged-110-90-70-50Log-likelihoodEM iteration05101520E-step: compute posterior over latent variablesM-step: maximize expected complete-data log-likelihood

The EM algorithm is one of the most important algorithms in statistical machine learning. Whenever your model contains latent (unobserved) variables, you cannot simply write down the log-likelihood and differentiate. the latent variables couple the parameters in a way that makes direct optimization intractable. EM sidesteps this by alternating between inferring the latent variables and updating the parameters.

Gaussian mixture models, hidden Markov models, factor analysis, probabilistic PCA, topic models. all of these are classically fitted with EM. And the core idea. maximizing a lower bound on the log-likelihood. is exactly the principle behind variational autoencoders and modern variational inference.

If you understand EM deeply, you understand the conceptual engine behind a huge fraction of probabilistic machine learning.

Mental Model

You want to maximize the log-likelihood logp(xθ)\log p(x \mid \theta), but there are hidden variables zz that make this hard. Direct marginalization logzp(x,zθ)\log \sum_z p(x, z \mid \theta) puts a sum inside a log, which is analytically and computationally painful.

EM's trick: instead of maximizing the log-likelihood directly, construct a lower bound (the ELBO) that is easy to compute and maximize. Iterate: tighten the bound (E-step), then maximize it (M-step). Each iteration is guaranteed to increase (or not decrease) the true log-likelihood.

The Missing-Data Formulation

Definition

Complete-Data Log-Likelihood

The complete-data log-likelihood is logp(x,zθ)\log p(x, z \mid \theta), where xx is the observed data and zz is the latent (missing) data. If we knew zz, optimization over θ\theta would typically be tractable: for exponential-family models it reduces to sufficient-statistic matching with a closed-form solution.

Definition

Incomplete-Data Log-Likelihood

The incomplete-data log-likelihood (or marginal log-likelihood) is:

logp(xθ)=logzp(x,zθ)\log p(x \mid \theta) = \log \sum_z p(x, z \mid \theta)

This is what we actually want to maximize, but the sum (or integral) inside the log makes it hard.

The fundamental problem: we want to maximize logp(xθ)\log p(x \mid \theta), but we can only easily work with logp(x,zθ)\log p(x, z \mid \theta). EM bridges this gap.

The ELBO Derivation

This is the mathematical heart of EM. For any distribution q(z)q(z) over the latent variables:

logp(xθ)=logzp(x,zθ)\log p(x \mid \theta) = \log \sum_z p(x, z \mid \theta)

=logzq(z)p(x,zθ)q(z)= \log \sum_z q(z) \frac{p(x, z \mid \theta)}{q(z)}

zq(z)logp(x,zθ)q(z)\geq \sum_z q(z) \log \frac{p(x, z \mid \theta)}{q(z)}

where the inequality is Jensen's inequality applied to the concave function log()\log(\cdot).

Definition

Evidence Lower Bound (ELBO)

The ELBO (Evidence Lower BOund) is:

L(q,θ)=zq(z)logp(x,zθ)q(z)=Eq[logp(x,zθ)]Eq[logq(z)]\mathcal{L}(q, \theta) = \sum_z q(z) \log \frac{p(x, z \mid \theta)}{q(z)} = \mathbb{E}_{q}[\log p(x, z \mid \theta)] - \mathbb{E}_{q}[\log q(z)]

It is a lower bound on logp(xθ)\log p(x \mid \theta) for any qq. Equivalently:

L(q,θ)=Eq[logp(x,zθ)]+H(q)\mathcal{L}(q, \theta) = \mathbb{E}_{q}[\log p(x, z \mid \theta)] + H(q)

where H(q)=zq(z)logq(z)H(q) = -\sum_z q(z) \log q(z) is the entropy of qq.

Theorem

ELBO Decomposition

Statement

For any distribution q(z)q(z) and parameters θ\theta:

logp(xθ)=L(q,θ)+DKL(q(z)p(zx,θ))\log p(x \mid \theta) = \mathcal{L}(q, \theta) + D_{\mathrm{KL}}(q(z) \| p(z \mid x, \theta))

where DKLD_{\mathrm{KL}} is the Kullback-Leibler divergence.

Intuition

The log-likelihood decomposes exactly into the ELBO plus the KL divergence between q(z)q(z) and the true posterior p(zx,θ)p(z \mid x, \theta). Since KL divergence is always 0\geq 0, the ELBO is always logp(xθ)\leq \log p(x \mid \theta). The bound is tight (equality holds) when q(z)=p(zx,θ)q(z) = p(z \mid x, \theta). that is, when qq equals the true posterior.

Proof Sketch

Start from the ELBO definition and expand:

L(q,θ)=zq(z)logp(x,zθ)q(z)\mathcal{L}(q, \theta) = \sum_z q(z) \log \frac{p(x, z \mid \theta)}{q(z)} =zq(z)logp(zx,θ)p(xθ)q(z)= \sum_z q(z) \log \frac{p(z \mid x, \theta) p(x \mid \theta)}{q(z)} =zq(z)logp(xθ)+zq(z)logp(zx,θ)q(z)= \sum_z q(z) \log p(x \mid \theta) + \sum_z q(z) \log \frac{p(z \mid x, \theta)}{q(z)} =logp(xθ)DKL(q(z)p(zx,θ))= \log p(x \mid \theta) - D_{\mathrm{KL}}(q(z) \| p(z \mid x, \theta))

Rearranging gives the decomposition.

Why It Matters

This decomposition is the raison d'etre of EM. It tells you that to make the ELBO as tight as possible (i.e., equal to the log-likelihood), you should set q(z)=p(zx,θ)q(z) = p(z \mid x, \theta). That is exactly what the E-step does. Then maximizing the ELBO over θ\theta (the M-step) increases the log-likelihood.

Failure Mode

If the true posterior p(zx,θ)p(z \mid x, \theta) is intractable to compute, you cannot run the E-step exactly. This is the setting where you need variational EM (restrict qq to a tractable family) or Monte Carlo EM (approximate the expectation by sampling).

The EM Algorithm

With the ELBO in hand, EM is simply coordinate ascent on L(q,θ)\mathcal{L}(q, \theta):

Initialize θ(0)\theta^{(0)}.

E-step (iteration tt): Set q(t)(z)=p(zx,θ(t))q^{(t)}(z) = p(z \mid x, \theta^{(t)}).

This makes the bound tight: L(q(t),θ(t))=logp(xθ(t))\mathcal{L}(q^{(t)}, \theta^{(t)}) = \log p(x \mid \theta^{(t)}).

M-step (iteration tt): Set θ(t+1)=argmaxθL(q(t),θ)\theta^{(t+1)} = \arg\max_{\theta} \mathcal{L}(q^{(t)}, \theta).

Since q(t)q^{(t)} is now fixed, this is equivalent to:

θ(t+1)=argmaxθEq(t)[logp(x,zθ)]\theta^{(t+1)} = \arg\max_{\theta} \mathbb{E}_{q^{(t)}}[\log p(x, z \mid \theta)]

because H(q(t))H(q^{(t)}) does not depend on θ\theta.

The quantity Q(θ,θ(t))=Ep(zx,θ(t))[logp(x,zθ)]Q(\theta, \theta^{(t)}) = \mathbb{E}_{p(z \mid x, \theta^{(t)})}[\log p(x, z \mid \theta)] is called the Q-function in the classical EM literature.

Monotonic Convergence

Theorem

EM Monotonically Increases the Likelihood

Statement

At each iteration of EM:

logp(xθ(t+1))logp(xθ(t))\log p(x \mid \theta^{(t+1)}) \geq \log p(x \mid \theta^{(t)})

with equality if and only if θ(t)\theta^{(t)} is already a fixed point of EM.

Intuition

The E-step makes the ELBO equal to the log-likelihood. The M-step increases the ELBO (or keeps it the same). Since the ELBO was equal to the log-likelihood at θ(t)\theta^{(t)}, and we increased it, the new log-likelihood at θ(t+1)\theta^{(t+1)} must be at least as large.

Proof Sketch

After the E-step: L(q(t),θ(t))=logp(xθ(t))\mathcal{L}(q^{(t)}, \theta^{(t)}) = \log p(x \mid \theta^{(t)}).

After the M-step: L(q(t),θ(t+1))L(q(t),θ(t))\mathcal{L}(q^{(t)}, \theta^{(t+1)}) \geq \mathcal{L}(q^{(t)}, \theta^{(t)}) by definition of maximization.

But L(q(t),θ(t+1))logp(xθ(t+1))\mathcal{L}(q^{(t)}, \theta^{(t+1)}) \leq \log p(x \mid \theta^{(t+1)}) because the ELBO is always a lower bound.

Chaining: logp(xθ(t+1))L(q(t),θ(t+1))L(q(t),θ(t))=logp(xθ(t))\log p(x \mid \theta^{(t+1)}) \geq \mathcal{L}(q^{(t)}, \theta^{(t+1)}) \geq \mathcal{L}(q^{(t)}, \theta^{(t)}) = \log p(x \mid \theta^{(t)})

Why It Matters

Monotonic convergence is a strong guarantee: EM never makes things worse. Combined with the fact that the log-likelihood is bounded above (for well-defined models), this ensures that EM converges to some stationary point. But it does not guarantee convergence to the global maximum.

Failure Mode

If the M-step only partially maximizes the Q-function (generalized EM), you still get monotonic convergence as long as Q(θ(t+1),θ(t))Q(θ(t),θ(t))Q(\theta^{(t+1)}, \theta^{(t)}) \geq Q(\theta^{(t)}, \theta^{(t)}). But convergence can be very slow, especially near saddle points.

Why Jensen's Inequality Is the Key

Jensen's inequality states that for a concave function ff:

f(E[X])E[f(X)]f(\mathbb{E}[X]) \geq \mathbb{E}[f(X)]

In the ELBO derivation, we apply this to f=logf = \log, which is concave. The "random variable" is p(x,zθ)/q(z)p(x, z \mid \theta) / q(z), and the expectation is with respect to q(z)q(z).

Without Jensen's inequality, we would have no lower bound, no ELBO, and no EM. The entire algorithm rests on the concavity of the logarithm.

Canonical Example: Gaussian Mixture Models

Example

EM for Gaussian Mixture Models

A GMM models data as arising from KK Gaussian components:

p(xθ)=k=1KπkN(xμk,Σk)p(x \mid \theta) = \sum_{k=1}^{K} \pi_k \, \mathcal{N}(x \mid \mu_k, \Sigma_k)

where θ={πk,μk,Σk}k=1K\theta = \{\pi_k, \mu_k, \Sigma_k\}_{k=1}^K and πk\pi_k are mixing weights with kπk=1\sum_k \pi_k = 1.

The latent variable zi{1,,K}z_i \in \{1, \ldots, K\} indicates which component generated datapoint xix_i.

E-step: Compute responsibilities (posterior over component assignments):

γik=p(zi=kxi,θ(t))=πk(t)N(xiμk(t),Σk(t))j=1Kπj(t)N(xiμj(t),Σj(t))\gamma_{ik} = p(z_i = k \mid x_i, \theta^{(t)}) = \frac{\pi_k^{(t)} \, \mathcal{N}(x_i \mid \mu_k^{(t)}, \Sigma_k^{(t)})}{\sum_{j=1}^K \pi_j^{(t)} \, \mathcal{N}(x_i \mid \mu_j^{(t)}, \Sigma_j^{(t)})}

M-step: Update parameters using the responsibilities as soft assignments:

Nk=i=1nγikN_k = \sum_{i=1}^n \gamma_{ik}

μk(t+1)=1Nki=1nγikxi\mu_k^{(t+1)} = \frac{1}{N_k} \sum_{i=1}^n \gamma_{ik} \, x_i

Σk(t+1)=1Nki=1nγik(xiμk(t+1))(xiμk(t+1))\Sigma_k^{(t+1)} = \frac{1}{N_k} \sum_{i=1}^n \gamma_{ik} (x_i - \mu_k^{(t+1)})(x_i - \mu_k^{(t+1)})^\top

πk(t+1)=Nkn\pi_k^{(t+1)} = \frac{N_k}{n}

Each iteration is guaranteed to increase (or maintain) the log-likelihood. The algorithm converges to a local maximum of the GMM likelihood.

HMMs and the Baum-Welch Algorithm

The Baum-Welch algorithm for training hidden Markov models is a special case of EM. The E-step uses the forward-backward algorithm to compute posterior marginals p(ztx1:T,θ)p(z_t \mid x_{1:T}, \theta) and pairwise posteriors p(zt,zt+1x1:T,θ)p(z_t, z_{t+1} \mid x_{1:T}, \theta) over the hidden states. The M-step re-estimates transition probabilities and emission parameters using these posteriors as soft counts.

The structure of the HMM makes the E-step tractable despite an exponential number of latent sequences, because the forward-backward algorithm exploits the chain structure via dynamic programming.

Common Confusions

Watch Out

EM does NOT find global optima

EM finds local optima (more precisely, stationary points) of the log-likelihood. The monotonic convergence theorem guarantees that the likelihood never decreases, but it says nothing about reaching the global maximum. For multi-modal likelihoods (which GMMs almost always have), the solution depends heavily on initialization. In practice, people run EM with multiple random restarts and take the solution with highest likelihood.

Watch Out

The E-step is expectation, not sampling

A common misconception is that the E-step involves sampling from the posterior p(zx,θ)p(z \mid x, \theta). It does not. The E-step computes Ep(zx,θ(t))[logp(x,zθ)]\mathbb{E}_{p(z \mid x, \theta^{(t)})}[\log p(x, z \mid \theta)] as a function of θ\theta. For discrete latent variables, this means computing exact posterior probabilities. For continuous latent variables, it means computing posterior moments analytically. If you sample instead of computing exact expectations, you are doing Monte Carlo EM, which is a different (stochastic) algorithm with different convergence properties.

Watch Out

EM is not just for mixtures

While GMMs are the canonical example, EM applies to any model with latent variables: factor analysis, probabilistic PCA, hidden Markov models, latent Dirichlet allocation (with variational E-step), missing data imputation, and many more. The missing-data formulation is the general framework.

Summary

  • EM maximizes a lower bound (the ELBO) on the log-likelihood
  • The E-step sets q(z)=p(zx,θ(t))q(z) = p(z \mid x, \theta^{(t)}), making the bound tight
  • The M-step maximizes Eq[logp(x,zθ)]\mathbb{E}_q[\log p(x, z \mid \theta)] over θ\theta
  • Each iteration monotonically increases the log-likelihood
  • Convergence is to a local optimum, not necessarily the global optimum
  • Jensen's inequality on the concave log\log function is the foundational step
  • For exponential family models, the M-step often has closed-form solutions

Exercises

ExerciseCore

Problem

Consider a two-component GMM in 1D with equal mixing weights π1=π2=0.5\pi_1 = \pi_2 = 0.5, known variance σ2=1\sigma^2 = 1, and unknown means μ1,μ2\mu_1, \mu_2. You observe a single datapoint x=0x = 0.

(a) Write the E-step: compute γ1=p(z=1x=0,μ1(t),μ2(t))\gamma_1 = p(z = 1 \mid x = 0, \mu_1^{(t)}, \mu_2^{(t)}).

(b) Write the M-step update for μ1\mu_1 and μ2\mu_2 given nn datapoints.

(c) If μ1(0)=1\mu_1^{(0)} = -1 and μ2(0)=1\mu_2^{(0)} = 1, what are γ1\gamma_1 and γ2\gamma_2 for x=0x = 0?

ExerciseAdvanced

Problem

Prove that if the M-step only partially maximizes the Q-function. that is, Q(θ(t+1),θ(t))Q(θ(t),θ(t))Q(\theta^{(t+1)}, \theta^{(t)}) \geq Q(\theta^{(t)}, \theta^{(t)}) but θ(t+1)\theta^{(t+1)} is not necessarily the global maximizer. The monotonic convergence property still holds. This is the Generalized EM (GEM) algorithm.

References

Canonical:

  • Dempster, Laird, Rubin, "Maximum Likelihood from Incomplete Data via the EM Algorithm" (1977). The original paper
  • Bishop, Pattern Recognition and Machine Learning, Chapter 9
  • McLachlan & Krishnan, The EM Algorithm and Extensions (2008)

Current:

  • Murphy, Probabilistic Machine Learning: An Introduction (2022), Chapter 8

  • Neal & Hinton, "A View of the EM Algorithm that Justifies Incremental, Sparse, and Other Variants" (1998)

  • Hastie, Tibshirani, Friedman, The Elements of Statistical Learning (2009)

Next Topics

The natural next steps from EM:

  • EM algorithm variants: Monte Carlo EM, variational EM, stochastic EM, and generalized EM for intractable E-steps and M-steps

Last reviewed: April 2026

Prerequisites

Foundations this topic depends on.

Builds on This

Next Topics