Beta. Content is under active construction and has not been peer-reviewed. Report errors on GitHub.Disclaimer

ML Methods

Gaussian Mixture Models and EM

GMMs as soft clustering with per-component Gaussians: EM derivation (E-step responsibilities, M-step parameter updates), convergence guarantees, model selection with BIC/AIC, and the connection to k-means as the hard-assignment limit.

CoreTier 2Stable~60 min

Why This Matters

Gaussian mixture models are the canonical example of latent variable models and the primary application of the EM algorithm. When you learn EM, you learn it through GMMs. When you learn variational inference, the GMM is the first test case.

GMMs are also practically important. They provide soft clustering (each point has a probability of belonging to each cluster, not a hard assignment), they can model non-convex cluster shapes through superposition, and they give a principled probabilistic framework for density estimation. Every Bayesian who works with mixtures uses GMMs as the starting point.

The connection to k-means -- which is GMMs in the limit of zero covariance -- unifies the two most important clustering methods.

Mental Model

K-means assigns each point to exactly one cluster (hard assignment). GMMs soften this: each point has a responsibility -- the probability it belongs to each cluster. The E-step computes these responsibilities. The M-step updates the cluster parameters (means, covariances, weights) using the responsibilities as soft counts. This is EM.

The key insight: the GMM log-likelihood has a sum inside the log (logkπkN(xμk,Σk)\log \sum_k \pi_k \mathcal{N}(x|\mu_k, \Sigma_k)), which is intractable to optimize directly. EM avoids this by alternating between inference (E-step) and optimization (M-step), each of which is tractable.

The GMM Model

Definition

Gaussian Mixture Model

A Gaussian mixture model with KK components defines the density:

p(xθ)=k=1KπkN(xμk,Σk)p(x \mid \theta) = \sum_{k=1}^{K} \pi_k \, \mathcal{N}(x \mid \mu_k, \Sigma_k)

where:

  • πk0\pi_k \geq 0 are the mixing weights with k=1Kπk=1\sum_{k=1}^K \pi_k = 1
  • μkRd\mu_k \in \mathbb{R}^d is the mean of component kk
  • ΣkRd×d\Sigma_k \in \mathbb{R}^{d \times d} is the covariance matrix of component kk (symmetric positive definite)
  • θ={πk,μk,Σk}k=1K\theta = \{\pi_k, \mu_k, \Sigma_k\}_{k=1}^K is the full parameter set

Each data point xix_i is generated by:

  1. Sample the component ziCategorical(π1,,πK)z_i \sim \text{Categorical}(\pi_1, \ldots, \pi_K)
  2. Sample xiN(μzi,Σzi)x_i \sim \mathcal{N}(\mu_{z_i}, \Sigma_{z_i})

The latent variable zi{1,,K}z_i \in \{1, \ldots, K\} indicates which component generated xix_i. We observe xix_i but not ziz_i.

The Log-Likelihood Problem

Given nn i.i.d. observations x1,,xnx_1, \ldots, x_n, the log-likelihood is:

(θ)=i=1nlog(k=1KπkN(xiμk,Σk))\ell(\theta) = \sum_{i=1}^n \log \left(\sum_{k=1}^K \pi_k \, \mathcal{N}(x_i \mid \mu_k, \Sigma_k)\right)

This is hard to optimize directly because:

  1. The sum is inside the log: you cannot separate the contributions of different components
  2. Non-convexity: the log-likelihood is non-convex in θ\theta, with multiple local maxima (permuting the component labels gives the same likelihood)
  3. Singularities: if μk=xi\mu_k = x_i for some ii and Σk0\Sigma_k \to 0, the likelihood goes to infinity (a degenerate component that collapses onto a single point)

EM addresses the first two issues. The third requires regularization (e.g., adding a small ϵI\epsilon I to each covariance).

EM for GMMs

The EM algorithm specializes to GMMs as follows.

Initialize θ(0)={πk(0),μk(0),Σk(0)}k=1K\theta^{(0)} = \{\pi_k^{(0)}, \mu_k^{(0)}, \Sigma_k^{(0)}\}_{k=1}^K. Common strategies: k-means initialization (run k-means, use the resulting centroids and within-cluster covariances) or random initialization with multiple restarts.

E-Step: Compute Responsibilities

Definition

Responsibility

The responsibility of component kk for data point xix_i is the posterior probability that xix_i was generated by component kk:

γik=p(zi=kxi,θ(t))=πk(t)N(xiμk(t),Σk(t))j=1Kπj(t)N(xiμj(t),Σj(t))\gamma_{ik} = p(z_i = k \mid x_i, \theta^{(t)}) = \frac{\pi_k^{(t)} \, \mathcal{N}(x_i \mid \mu_k^{(t)}, \Sigma_k^{(t)})}{\sum_{j=1}^K \pi_j^{(t)} \, \mathcal{N}(x_i \mid \mu_j^{(t)}, \Sigma_j^{(t)})}

The responsibilities satisfy γik[0,1]\gamma_{ik} \in [0, 1] and k=1Kγik=1\sum_{k=1}^K \gamma_{ik} = 1 for each ii.

The E-step is simply evaluating Bayes' rule. The responsibilities are soft assignments: γik=0.7\gamma_{ik} = 0.7 means data point ii belongs to component kk with probability 0.7.

M-Step: Update Parameters

Given the responsibilities, the M-step updates are closed-form. Define the effective number of points assigned to component kk:

Nk=i=1nγikN_k = \sum_{i=1}^n \gamma_{ik}

Then the parameter updates are:

Mixing weights: πk(t+1)=Nkn\pi_k^{(t+1)} = \frac{N_k}{n}

Means: μk(t+1)=1Nki=1nγikxi\mu_k^{(t+1)} = \frac{1}{N_k} \sum_{i=1}^n \gamma_{ik} \, x_i

Covariances: Σk(t+1)=1Nki=1nγik(xiμk(t+1))(xiμk(t+1))\Sigma_k^{(t+1)} = \frac{1}{N_k} \sum_{i=1}^n \gamma_{ik} (x_i - \mu_k^{(t+1)})(x_i - \mu_k^{(t+1)})^\top

These are exactly the weighted maximum likelihood estimates, using the responsibilities as weights.

Convergence

Proposition

EM for GMMs Monotonically Increases the Log-Likelihood

Statement

At each iteration of EM for GMMs:

(θ(t+1))(θ(t))\ell(\theta^{(t+1)}) \geq \ell(\theta^{(t)})

The log-likelihood is non-decreasing. Since it is bounded above (for non-degenerate models), the sequence (θ(t))\ell(\theta^{(t)}) converges.

Intuition

This is a special case of the general EM convergence theorem. The E-step makes the ELBO equal to the log-likelihood. The M-step increases the ELBO. Therefore the log-likelihood does not decrease.

Proof Sketch

This follows directly from the general EM convergence theorem (see the EM algorithm topic). The GMM is an exponential family mixture, so the M-step has closed-form solutions that globally maximize the Q-function. Both conditions of the general theorem -- exact E-step and global M-step -- are satisfied.

Why It Matters

Monotonic convergence means EM is a safe algorithm: it never makes things worse. Combined with the fact that the GMM log-likelihood is bounded above, EM converges to a stationary point. However, this stationary point is typically a local maximum, not the global maximum. Multiple random restarts are essential.

Failure Mode

EM can converge to poor local optima, especially with bad initialization. It can also converge slowly when components overlap heavily (the responsibilities are all close to 1/K1/K, giving little information). In extreme cases, a component can collapse onto a single data point (Σk0\Sigma_k \to 0), producing an unbounded likelihood. Regularization of the covariance (e.g., Σk+ϵI\Sigma_k + \epsilon I) prevents this.

Choosing K: Model Selection

The number of components KK is a hyperparameter. More components fit the data better (higher likelihood) but risk overfitting. Standard model selection criteria penalize model complexity:

Definition

Bayesian Information Criterion (BIC)

BIC=2(θ^)+plogn\text{BIC} = -2\ell(\hat\theta) + p \log n

where pp is the number of free parameters and nn is the sample size. For a GMM with KK components in dd dimensions:

p=K(d(d+1)2+d+1)1p = K\left(\frac{d(d+1)}{2} + d + 1\right) - 1

(each component has d(d+1)/2d(d+1)/2 covariance parameters, dd mean parameters, and 1 weight parameter, minus 1 for the constraint πk=1\sum \pi_k = 1).

Choose KK minimizing BIC. BIC penalizes complexity more heavily than AIC and tends to select simpler models.

Definition

Akaike Information Criterion (AIC)

AIC=2(θ^)+2p\text{AIC} = -2\ell(\hat\theta) + 2p

AIC replaces logn\log n with 2, penalizing complexity less than BIC. AIC tends to select larger KK and is preferred when the goal is prediction rather than model identification.

Silhouette score: a non-probabilistic alternative. For each point, compare the average distance to its own cluster with the average distance to the nearest other cluster. Silhouette values range from 1-1 to +1+1; higher is better. Choose KK maximizing the average silhouette score.

Connection to K-Means

Proposition

K-Means as the Hard-Assignment Limit of GMMs

Statement

Consider a GMM with isotropic covariances Σk=σ2I\Sigma_k = \sigma^2 I for all kk. As σ0\sigma \to 0, the EM algorithm for this GMM reduces to the k-means algorithm (Lloyd's algorithm):

  • The E-step responsibilities become hard assignments: γik1\gamma_{ik} \to 1 for the nearest centroid and γik0\gamma_{ik} \to 0 otherwise
  • The M-step mean updates become the k-means centroid updates

Intuition

When σ\sigma is very small, each Gaussian component is a narrow spike at μk\mu_k. The responsibility γik\gamma_{ik} is determined almost entirely by the distance xiμk\|x_i - \mu_k\|: the closest component gets responsibility 1\approx 1, all others get 0\approx 0. This is exactly hard assignment to the nearest centroid. The M-step becomes the centroid update μk=1CkiCkxi\mu_k = \frac{1}{|C_k|}\sum_{i \in C_k} x_i.

Proof Sketch

With Σk=σ2I\Sigma_k = \sigma^2 I, the Gaussian density is:

N(xiμk,σ2I)exp ⁣(xiμk22σ2)\mathcal{N}(x_i \mid \mu_k, \sigma^2 I) \propto \exp\!\left(-\frac{\|x_i - \mu_k\|^2}{2\sigma^2}\right)

The responsibility ratio for components kk and jj:

γikγij=πkπjexp ⁣(xiμk2xiμj22σ2)\frac{\gamma_{ik}}{\gamma_{ij}} = \frac{\pi_k}{\pi_j} \exp\!\left(-\frac{\|x_i - \mu_k\|^2 - \|x_i - \mu_j\|^2}{2\sigma^2}\right)

As σ0\sigma \to 0, the exponential drives the ratio to \infty if μk\mu_k is closer, or 00 if μj\mu_j is closer. The assignment becomes hard: γik=1\gamma_{ik} = 1 for k=argminjxiμjk = \arg\min_j \|x_i - \mu_j\|, and γik=0\gamma_{ik} = 0 otherwise. This is exactly the k-means assignment step.

Why It Matters

This connection explains why k-means works well when clusters are approximately spherical and equally sized: it is the correct algorithm for isotropic equal-weight GMMs. It also explains when k-means fails: clusters with very different sizes, shapes (non-spherical), or densities violate the isotropic equal-weight assumption. GMMs handle these cases because they learn separate Σk\Sigma_k and πk\pi_k for each component.

Failure Mode

The limit assumes equal mixing weights and shared isotropic covariance. If different components have different covariances, the σ0\sigma \to 0 limit of the full GMM does not reduce to standard k-means but to a weighted version. K-means implicitly assumes all clusters have the same shape and size.

Canonical Examples

Example

EM for a 2-component 1D GMM

Data: n=6n = 6 points at x={1,2,3,7,8,9}x = \{1, 2, 3, 7, 8, 9\}. Fit a 2-component GMM with known variance σ2=1\sigma^2 = 1 and equal weights π1=π2=0.5\pi_1 = \pi_2 = 0.5.

Initialize: μ1(0)=2,μ2(0)=7\mu_1^{(0)} = 2, \mu_2^{(0)} = 7.

E-step: For x=3x = 3: γ1exp((32)2/2)=e0.50.607\gamma_{1} \propto \exp(-(3-2)^2/2) = e^{-0.5} \approx 0.607. γ2exp((37)2/2)=e80.000335\gamma_{2} \propto \exp(-(3-7)^2/2) = e^{-8} \approx 0.000335. So γ10.999\gamma_{1} \approx 0.999 -- point 3 belongs overwhelmingly to component 1.

For x=7x = 7: symmetrically, γ20.999\gamma_{2} \approx 0.999.

M-step: μ1(1)(1+2+3)/3=2\mu_1^{(1)} \approx (1 + 2 + 3)/3 = 2 (near-hard assignment for well-separated components). Similarly μ2(1)8\mu_2^{(1)} \approx 8.

After a few iterations, EM converges to μ12\mu_1 \approx 2 and μ28\mu_2 \approx 8.

Example

When k-means fails but GMMs succeed

Consider two clusters: one small and tight (50 points near the origin, σ=0.1\sigma = 0.1), one large and diffuse (200 points spread over a wide area, σ=5\sigma = 5). K-means assigns points based purely on distance, so it splits the large cluster to equalize sizes. GMMs learn different covariances Σ10.01I\Sigma_1 \approx 0.01 I and Σ225I\Sigma_2 \approx 25 I, correctly identifying the two clusters regardless of their size difference.

Common Confusions

Watch Out

GMMs can model non-Gaussian shapes via superposition

A single Gaussian component is always ellipsoidal. But a mixture of Gaussians can approximate any continuous density to arbitrary accuracy (this is a universal approximation result for mixtures). A ring-shaped cluster can be modeled by many small Gaussians arranged in a ring. The power of GMMs is in the mixture, not in the individual components.

Watch Out

EM does not guarantee finding the right K

EM fits a GMM with a given KK. It does not tell you what KK to use. Choosing KK requires model selection (BIC, AIC, cross-validation, or Bayesian nonparametric approaches like Dirichlet process mixtures). Running EM with the wrong KK gives a perfectly valid local maximum of the wrong model.

Watch Out

Responsibilities are not cluster assignments

Responsibilities γik\gamma_{ik} are posterior probabilities, not binary assignments. A point with γi1=0.6\gamma_{i1} = 0.6 and γi2=0.4\gamma_{i2} = 0.4 is genuinely uncertain -- it sits between two overlapping clusters. Treating GMMs as hard clusterers by assigning each point to its highest-responsibility component loses information. The soft assignments are a feature, not a bug.

Summary

  • GMM: each cluster is a Gaussian with its own μk\mu_k, Σk\Sigma_k, and weight πk\pi_k
  • The log-likelihood has a sum inside the log, making direct optimization intractable
  • E-step: compute responsibilities γik\gamma_{ik} (posterior cluster probabilities) via Bayes' rule
  • M-step: update μk,Σk,πk\mu_k, \Sigma_k, \pi_k using responsibilities as soft counts -- all updates are closed-form
  • EM monotonically increases the log-likelihood but converges to local optima
  • K-means is the limit of GMMs with isotropic covariance as σ0\sigma \to 0 (hard assignment = soft assignment in the zero-variance limit)
  • Choose KK with BIC (conservative) or AIC (liberal), not with the log-likelihood alone

Exercises

ExerciseCore

Problem

Write the E-step update for a 2-component GMM in 1D with parameters π1,π2,μ1,μ2,σ12,σ22\pi_1, \pi_2, \mu_1, \mu_2, \sigma_1^2, \sigma_2^2. Compute the responsibility γi1\gamma_{i1} for a data point xi=5x_i = 5 when π1=0.3,μ1=3,σ1=1,π2=0.7,μ2=7,σ2=2\pi_1 = 0.3, \mu_1 = 3, \sigma_1 = 1, \pi_2 = 0.7, \mu_2 = 7, \sigma_2 = 2.

ExerciseAdvanced

Problem

Prove that k-means is a special case of EM for GMMs. Specifically, show that with isotropic covariances Σk=σ2I\Sigma_k = \sigma^2 I and equal weights, the E-step reduces to hard assignment and the M-step reduces to centroid updates as σ0\sigma \to 0.

References

Canonical:

  • Bishop, Pattern Recognition and Machine Learning, Chapter 9 -- the definitive textbook treatment
  • McLachlan & Peel, Finite Mixture Models (2000)
  • Dempster, Laird, Rubin, "Maximum Likelihood from Incomplete Data via the EM Algorithm" (1977)

Current:

  • Murphy, Probabilistic Machine Learning: An Introduction (2022), Chapter 8

  • Schwarz, "Estimating the Dimension of a Model" (1978) -- the BIC paper

  • Hastie, Tibshirani, Friedman, The Elements of Statistical Learning (2009), Chapters 3-15

Next Topics

Building on GMMs and EM:

  • Variational autoencoders: replacing the E-step with neural network amortized inference and extending to continuous latent spaces

Last reviewed: April 2026

Prerequisites

Foundations this topic depends on.

Builds on This

Next Topics