Gaussian Mixture Models and EM

Sneiderman, Robby

ML Methods

Gaussian Mixture Models and EM

GMMs as soft clustering with per-component Gaussians: EM derivation (E-step responsibilities, M-step parameter updates), convergence guarantees, model selection with BIC/AIC, and the connection to k-means as the hard-assignment limit.

CoreTier 2StableSupporting~60 min

Prerequisites

K Means Clustering EM Algorithm Maximum Likelihood Estimation

Start 8-question practice · 4 available Prereq Map

Learning position

Read this page in the graph.

ml-methods | layer 2 | tier 2. This page has 3 direct prerequisites and 2 published dependents.

Open Atlas Prerequisites Leads to

What next

Variational Autoencoders

This is the first curated or graph-derived continuation from the current page.

Evidence badge

Claim status

This page has no public Lean mapping yet. Use the evidence page to inspect how claim status labels work.

Show the backing system

AtlasOpen the full prerequisite graph and run grounding traces.EvidenceInspect source support, claim labels, and public trust status.LeanReview the checked declaration list, scopes, and axiom profile.

Why This Matters

Gaussian mixture models are the canonical example of latent variable models and the primary application of the EM algorithm. When you learn EM, you learn it through GMMs. When you learn variational inference, the GMM is the first test case.

GMMs are also practically important. They provide soft clustering (each point has a probability of belonging to each cluster, not a hard assignment), they can model non-convex cluster shapes through superposition, and they give a principled probabilistic framework for density estimation. Every Bayesian who works with mixtures uses GMMs as the starting point.

The connection to k-means -- which is the small-variance limit of a GMM with equal isotropic covariances and equal mixing weights -- unifies the two most important clustering methods. The qualifier matters: sending covariance to zero in a GMM with unequal or non-isotropic $\Sigma_k$ does not give k-means, and the unconstrained zero-covariance limit hits the classic GMM singularity where one component collapses onto a single point and the likelihood diverges.

Mental Model

K-means assigns each point to exactly one cluster (hard assignment). GMMs soften this: each point has a responsibility -- the probability it belongs to each cluster. The E-step computes these responsibilities. The M-step updates the cluster parameters (means, covariances, weights) using the responsibilities as soft counts. This is EM.

The key insight: the GMM log-likelihood has a sum inside the log ( $\log \sum_k \pi_k \mathcal{N}(x|\mu_k, \Sigma_k)$ ), which is intractable to optimize directly. EM avoids this by alternating between inference (E-step) and optimization (M-step), each of which is tractable.

The GMM Model

Definition

Gaussian Mixture Model $p (x ∣ θ)$

A Gaussian mixture model with $K$ components defines the density:

$p(x \mid \theta) = \sum_{k=1}^{K} \pi_k \, \mathcal{N}(x \mid \mu_k, \Sigma_k)$

where:

$\pi_k \geq 0$ are the mixing weights with $\sum_{k=1}^K \pi_k = 1$
$\mu_k \in \mathbb{R}^d$ is the mean of component $k$
$\Sigma_k \in \mathbb{R}^{d \times d}$ is the covariance matrix of component $k$ (symmetric positive definite)
$\theta = \{\pi_k, \mu_k, \Sigma_k\}_{k=1}^K$ is the full parameter set

Each data point $x_i$ is generated by:

Sample the component $z_i \sim \text{Categorical}(\pi_1, \ldots, \pi_K)$
Sample $x_i \sim \mathcal{N}(\mu_{z_i}, \Sigma_{z_i})$

The latent variable $z_i \in \{1, \ldots, K\}$ indicates which component generated $x_i$ . We observe $x_i$ but not $z_i$ .

The Log-Likelihood Problem

Given $n$ i.i.d. observations $x_1, \ldots, x_n$ , the log-likelihood is:

$\ell(\theta) = \sum_{i=1}^n \log \left(\sum_{k=1}^K \pi_k \, \mathcal{N}(x_i \mid \mu_k, \Sigma_k)\right)$

This is hard to optimize directly because:

The sum is inside the log: you cannot separate the contributions of different components
Non-convexity: the log-likelihood is non-convex in $\theta$ , with multiple local maxima (permuting the component labels gives the same likelihood)
Singularities: if $\mu_k = x_i$ for some $i$ and $\Sigma_k \to 0$ , the likelihood goes to infinity (a degenerate component that collapses onto a single point)

EM addresses the first two issues. The third requires regularization (e.g., adding a small $\epsilon I$ to each covariance).

EM for GMMs

The EM algorithm specializes to GMMs as follows.

Initialize $\theta^{(0)} = \{\pi_k^{(0)}, \mu_k^{(0)}, \Sigma_k^{(0)}\}_{k=1}^K$ . Common strategies: k-means initialization (run k-means, use the resulting centroids and within-cluster covariances) or random initialization with multiple restarts.

E-Step: Compute Responsibilities

Definition

Responsibility $γ_{ik}$

The responsibility of component $k$ for data point $x_i$ is the posterior probability that $x_i$ was generated by component $k$ :

$\gamma_{ik} = p(z_i = k \mid x_i, \theta^{(t)}) = \frac{\pi_k^{(t)} \, \mathcal{N}(x_i \mid \mu_k^{(t)}, \Sigma_k^{(t)})}{\sum_{j=1}^K \pi_j^{(t)} \, \mathcal{N}(x_i \mid \mu_j^{(t)}, \Sigma_j^{(t)})}$

The responsibilities satisfy $\gamma_{ik} \in [0, 1]$ and $\sum_{k=1}^K \gamma_{ik} = 1$ for each $i$ .

The E-step is simply evaluating Bayes' rule. The responsibilities are soft assignments: $\gamma_{ik} = 0.7$ means data point $i$ belongs to component $k$ with probability 0.7.

M-Step: Update Parameters

Given the responsibilities, the M-step updates are closed-form. Define the effective number of points assigned to component $k$ :

$N_k = \sum_{i=1}^n \gamma_{ik}$

Then the parameter updates are:

Mixing weights: $\pi_k^{(t+1)} = \frac{N_k}{n}$

Means: $\mu_k^{(t+1)} = \frac{1}{N_k} \sum_{i=1}^n \gamma_{ik} \, x_i$

Covariances: $\Sigma_k^{(t+1)} = \frac{1}{N_k} \sum_{i=1}^n \gamma_{ik} (x_i - \mu_k^{(t+1)})(x_i - \mu_k^{(t+1)})^\top$

These are exactly the weighted maximum likelihood estimates, using the responsibilities as weights.

Convergence

Proposition

EM for GMMs Monotonically Increases the Log-Likelihood

Statement

At each iteration of EM for GMMs:

$\ell(\theta^{(t+1)}) \geq \ell(\theta^{(t)})$

The log-likelihood is non-decreasing. Since it is bounded above (for non-degenerate models), the sequence $\ell(\theta^{(t)})$ converges.

Intuition

This is a special case of the general EM convergence theorem. The E-step makes the ELBO equal to the log-likelihood. The M-step increases the ELBO. Therefore the log-likelihood does not decrease.

Proof Sketch

This follows directly from the general EM convergence theorem (see the EM algorithm topic). The GMM is an exponential family mixture, so the M-step has closed-form solutions that globally maximize the Q-function. Both conditions of the general theorem -- exact E-step and global M-step -- are satisfied.

Why It Matters

Monotonic convergence means EM is a safe algorithm: it never makes things worse. Combined with the fact that the GMM log-likelihood is bounded above, EM converges to a stationary point. However, this stationary point is typically a local maximum, not the global maximum. Multiple random restarts are essential.

Failure Mode

EM can converge to poor local optima, especially with bad initialization. It can also converge slowly when components overlap heavily (the responsibilities are all close to $1/K$ , giving little information). In extreme cases, a component can collapse onto a single data point ( $\Sigma_k \to 0$ ), producing an unbounded likelihood. Regularization of the covariance (e.g., $\Sigma_k + \epsilon I$ ) prevents this.

report a correction →

Choosing K: Model Selection

The number of components $K$ is a hyperparameter. More components fit the data better (higher likelihood) but risk overfitting. Standard model selection criteria penalize model complexity:

Definition

Bayesian Information Criterion (BIC)

$\text{BIC} = -2\ell(\hat\theta) + p \log n$

where $p$ is the number of free parameters and $n$ is the sample size. For a GMM with $K$ components in $d$ dimensions:

$p = K\left(\frac{d(d+1)}{2} + d + 1\right) - 1$

(each component has $d(d+1)/2$ covariance parameters, $d$ mean parameters, and 1 weight parameter, minus 1 for the constraint $\sum \pi_k = 1$ ).

Choose $K$ minimizing BIC. BIC penalizes complexity more heavily than AIC and tends to select simpler models.

Definition

Akaike Information Criterion (AIC)

$\text{AIC} = -2\ell(\hat\theta) + 2p$

AIC replaces $\log n$ with 2, penalizing complexity less than BIC. AIC tends to select larger $K$ and is preferred when the goal is prediction rather than model identification.

Silhouette score: a non-probabilistic alternative. For each point, compare the average distance to its own cluster with the average distance to the nearest other cluster. Silhouette values range from $-1$ to $+1$ ; higher is better. Choose $K$ maximizing the average silhouette score.

Connection to K-Means

Proposition

K-Means as the Hard-Assignment Limit of GMMs

Statement

Consider a GMM with isotropic covariances $\Sigma_k = \sigma^2 I$ for all $k$ . As $\sigma \to 0$ , the EM algorithm for this GMM reduces to the k-means algorithm (Lloyd's algorithm):

The E-step responsibilities become hard assignments: $\gamma_{ik} \to 1$ for the nearest centroid and $\gamma_{ik} \to 0$ otherwise
The M-step mean updates become the k-means centroid updates

Intuition

When $\sigma$ is very small, each Gaussian component is a narrow spike at $\mu_k$ . The responsibility $\gamma_{ik}$ is determined almost entirely by the distance $\|x_i - \mu_k\|$ : the closest component gets responsibility $\approx 1$ , all others get $\approx 0$ . This is exactly hard assignment to the nearest centroid. The M-step becomes the centroid update $\mu_k = \frac{1}{|C_k|}\sum_{i \in C_k} x_i$ .

Proof Sketch

With $\Sigma_k = \sigma^2 I$ , the Gaussian density is:

$\mathcal{N}(x_i \mid \mu_k, \sigma^2 I) \propto \exp\!\left(-\frac{\|x_i - \mu_k\|^2}{2\sigma^2}\right)$

The responsibility ratio for components $k$ and $j$ :

$\frac{\gamma_{ik}}{\gamma_{ij}} = \frac{\pi_k}{\pi_j} \exp\!\left(-\frac{\|x_i - \mu_k\|^2 - \|x_i - \mu_j\|^2}{2\sigma^2}\right)$

As $\sigma \to 0$ , the exponential drives the ratio to $\infty$ if $\mu_k$ is closer, or $0$ if $\mu_j$ is closer. The assignment becomes hard: $\gamma_{ik} = 1$ for $k = \arg\min_j \|x_i - \mu_j\|$ , and $\gamma_{ik} = 0$ otherwise. This is exactly the k-means assignment step.

Why It Matters

This connection explains why k-means works well when clusters are approximately spherical and equally sized: it is the correct algorithm for isotropic equal-weight GMMs. It also explains when k-means fails: clusters with very different sizes, shapes (non-spherical), or densities violate the isotropic equal-weight assumption. GMMs handle these cases because they learn separate $\Sigma_k$ and $\pi_k$ for each component.

Failure Mode

The limit assumes equal mixing weights and shared isotropic covariance. If different components have different covariances, the $\sigma \to 0$ limit of the full GMM does not reduce to standard k-means but to a weighted version. K-means implicitly assumes all clusters have the same shape and size.

report a correction →

Canonical Examples

Example

EM for a 2-component 1D GMM

Data: $n = 6$ points at $x = \{1, 2, 3, 7, 8, 9\}$ . Fit a 2-component GMM with known variance $\sigma^2 = 1$ and equal weights $\pi_1 = \pi_2 = 0.5$ .

Initialize: $\mu_1^{(0)} = 2, \mu_2^{(0)} = 7$ .

E-step: For $x = 3$ : $\gamma_{1} \propto \exp(-(3-2)^2/2) = e^{-0.5} \approx 0.607$ . $\gamma_{2} \propto \exp(-(3-7)^2/2) = e^{-8} \approx 0.000335$ . So $\gamma_{1} \approx 0.999$ -- point 3 belongs overwhelmingly to component 1.

For $x = 7$ : symmetrically, $\gamma_{2} \approx 0.999$ .

M-step: $\mu_1^{(1)} \approx (1 + 2 + 3)/3 = 2$ (near-hard assignment for well-separated components). Similarly $\mu_2^{(1)} \approx 8$ .

After a few iterations, EM converges to $\mu_1 \approx 2$ and $\mu_2 \approx 8$ .

Example

When k-means fails but GMMs succeed

Consider two clusters: one small and tight (50 points near the origin, $\sigma = 0.1$ ), one large and diffuse (200 points spread over a wide area, $\sigma = 5$ ). K-means assigns points based purely on distance, so it splits the large cluster to equalize sizes. GMMs learn different covariances $\Sigma_1 \approx 0.01 I$ and $\Sigma_2 \approx 25 I$ , correctly identifying the two clusters regardless of their size difference.

Common Confusions

Watch Out

GMMs can model non-Gaussian shapes via superposition

A single Gaussian component is always ellipsoidal. But a mixture of Gaussians can approximate any continuous density to arbitrary accuracy (this is a universal approximation result for mixtures). A ring-shaped cluster can be modeled by many small Gaussians arranged in a ring. The power of GMMs is in the mixture, not in the individual components.

Watch Out

EM does not guarantee finding the right K

EM fits a GMM with a given $K$ . It does not tell you what $K$ to use. Choosing $K$ requires model selection (BIC, AIC, cross-validation, or Bayesian nonparametric approaches like Dirichlet process mixtures). Running EM with the wrong $K$ gives a perfectly valid local maximum of the wrong model.

Watch Out

Responsibilities are not cluster assignments

Responsibilities $\gamma_{ik}$ are posterior probabilities, not binary assignments. A point with $\gamma_{i1} = 0.6$ and $\gamma_{i2} = 0.4$ is genuinely uncertain -- it sits between two overlapping clusters. Treating GMMs as hard clusterers by assigning each point to its highest-responsibility component loses information. The soft assignments are a feature, not a bug.

Summary

GMM: each cluster is a Gaussian with its own $\mu_k$ , $\Sigma_k$ , and weight $\pi_k$
The log-likelihood has a sum inside the log, making direct optimization intractable
E-step: compute responsibilities $\gamma_{ik}$ (posterior cluster probabilities) via Bayes' rule
M-step: update $\mu_k, \Sigma_k, \pi_k$ using responsibilities as soft counts -- all updates are closed-form
EM monotonically increases the log-likelihood but converges to local optima
K-means is the small-variance limit of a GMM with equal isotropic covariance and equal mixing weights (hard assignment emerges as the softmax temperature $\sigma^2 \to 0$ ); the limit is not the generic "zero covariance" limit of arbitrary GMMs
Choose $K$ with BIC (conservative) or AIC (liberal), not with the log-likelihood alone

Exercises

ExerciseCore

Problem

Write the E-step update for a 2-component GMM in 1D with parameters $\pi_1, \pi_2, \mu_1, \mu_2, \sigma_1^2, \sigma_2^2$ . Compute the responsibility $\gamma_{i1}$ for a data point $x_i = 5$ when $\pi_1 = 0.3, \mu_1 = 3, \sigma_1 = 1, \pi_2 = 0.7, \mu_2 = 7, \sigma_2 = 2$ .

ExerciseAdvanced

Problem

Prove that k-means is a special case of EM for GMMs. Specifically, show that with isotropic covariances $\Sigma_k = \sigma^2 I$ and equal weights, the E-step reduces to hard assignment and the M-step reduces to centroid updates as $\sigma \to 0$ .

References

Canonical:

Bishop, Pattern Recognition and Machine Learning (2006), Chapter 9 -- the definitive textbook treatment of EM and GMMs; Chapter 10 covers variational Bayesian GMMs.
McLachlan & Peel, Finite Mixture Models (Wiley, 2000). The standard mixture-modeling reference.
Dempster, Laird, Rubin, "Maximum Likelihood from Incomplete Data via the EM Algorithm" (JRSS-B 39(1), 1977). The original EM paper.
Wu, "On the Convergence Properties of the EM Algorithm" (Annals of Statistics 11(1), 1983). The rigorous convergence analysis: EM converges to a stationary point of the likelihood under mild regularity.

Initialization and model selection:

Arthur & Vassilvitskii, "k-means++: The Advantages of Careful Seeding" (SODA 2007). The standard initialization for k-means and (by extension) GMM.
Schwarz, "Estimating the Dimension of a Model" (Annals of Statistics 6(2), 1978) -- the BIC paper.
Akaike, "A New Look at the Statistical Model Identification" (IEEE Trans. Automatic Control 19(6), 1974) -- the AIC paper.

Bayesian and nonparametric extensions:

Rasmussen, "The Infinite Gaussian Mixture Model" (NeurIPS 2000). The Dirichlet-process mixture as a $K \to \infty$ limit, sidestepping model selection.
Blei & Jordan, "Variational Inference for Dirichlet Process Mixtures" (Bayesian Analysis 1(1), 2006). Tractable posterior over the number of components.
Bishop, "Mixture Density Networks" (Aston University Tech. Report NCRG/94/004, 1994). Conditional mixture predictions: a network outputs the parameters of a per-input GMM.

Current:

Murphy, Probabilistic Machine Learning: An Introduction (MIT Press, 2022), Chapter 8.
Hastie, Tibshirani, Friedman, The Elements of Statistical Learning (2009), Chapter 8.5.

Next Topics

Building on GMMs and EM:

Variational autoencoders: replacing the E-step with neural network amortized inference and extending to continuous latent spaces

Last reviewed: April 26, 2026

Canonical graph

Required before and derived from this topic

These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.

Full prerequisite chain All derived topics

Required prerequisites

3

Maximum Likelihood Estimation: Theory, Information Identity, and Asymptotic Efficiencylayer 0B · tier 1
K-Means Clusteringlayer 1 · tier 1
The EM Algorithmlayer 2 · tier 1

Derived topics

2

Variational Autoencoderslayer 3 · tier 1
Mixture Density Networkslayer 3 · tier 3

Graph-backed continuations

Variational Autoencoders Mixture Density Networks