ML Methods
Gaussian Mixture Models and EM
GMMs as soft clustering with per-component Gaussians: EM derivation (E-step responsibilities, M-step parameter updates), convergence guarantees, model selection with BIC/AIC, and the connection to k-means as the hard-assignment limit.
Why This Matters
Gaussian mixture models are the canonical example of latent variable models and the primary application of the EM algorithm. When you learn EM, you learn it through GMMs. When you learn variational inference, the GMM is the first test case.
GMMs are also practically important. They provide soft clustering (each point has a probability of belonging to each cluster, not a hard assignment), they can model non-convex cluster shapes through superposition, and they give a principled probabilistic framework for density estimation. Every Bayesian who works with mixtures uses GMMs as the starting point.
The connection to k-means -- which is GMMs in the limit of zero covariance -- unifies the two most important clustering methods.
Mental Model
K-means assigns each point to exactly one cluster (hard assignment). GMMs soften this: each point has a responsibility -- the probability it belongs to each cluster. The E-step computes these responsibilities. The M-step updates the cluster parameters (means, covariances, weights) using the responsibilities as soft counts. This is EM.
The key insight: the GMM log-likelihood has a sum inside the log (), which is intractable to optimize directly. EM avoids this by alternating between inference (E-step) and optimization (M-step), each of which is tractable.
The GMM Model
Gaussian Mixture Model
A Gaussian mixture model with components defines the density:
where:
- are the mixing weights with
- is the mean of component
- is the covariance matrix of component (symmetric positive definite)
- is the full parameter set
Each data point is generated by:
- Sample the component
- Sample
The latent variable indicates which component generated . We observe but not .
The Log-Likelihood Problem
Given i.i.d. observations , the log-likelihood is:
This is hard to optimize directly because:
- The sum is inside the log: you cannot separate the contributions of different components
- Non-convexity: the log-likelihood is non-convex in , with multiple local maxima (permuting the component labels gives the same likelihood)
- Singularities: if for some and , the likelihood goes to infinity (a degenerate component that collapses onto a single point)
EM addresses the first two issues. The third requires regularization (e.g., adding a small to each covariance).
EM for GMMs
The EM algorithm specializes to GMMs as follows.
Initialize . Common strategies: k-means initialization (run k-means, use the resulting centroids and within-cluster covariances) or random initialization with multiple restarts.
E-Step: Compute Responsibilities
Responsibility
The responsibility of component for data point is the posterior probability that was generated by component :
The responsibilities satisfy and for each .
The E-step is simply evaluating Bayes' rule. The responsibilities are soft assignments: means data point belongs to component with probability 0.7.
M-Step: Update Parameters
Given the responsibilities, the M-step updates are closed-form. Define the effective number of points assigned to component :
Then the parameter updates are:
Mixing weights:
Means:
Covariances:
These are exactly the weighted maximum likelihood estimates, using the responsibilities as weights.
Convergence
EM for GMMs Monotonically Increases the Log-Likelihood
Statement
At each iteration of EM for GMMs:
The log-likelihood is non-decreasing. Since it is bounded above (for non-degenerate models), the sequence converges.
Intuition
This is a special case of the general EM convergence theorem. The E-step makes the ELBO equal to the log-likelihood. The M-step increases the ELBO. Therefore the log-likelihood does not decrease.
Proof Sketch
This follows directly from the general EM convergence theorem (see the EM algorithm topic). The GMM is an exponential family mixture, so the M-step has closed-form solutions that globally maximize the Q-function. Both conditions of the general theorem -- exact E-step and global M-step -- are satisfied.
Why It Matters
Monotonic convergence means EM is a safe algorithm: it never makes things worse. Combined with the fact that the GMM log-likelihood is bounded above, EM converges to a stationary point. However, this stationary point is typically a local maximum, not the global maximum. Multiple random restarts are essential.
Failure Mode
EM can converge to poor local optima, especially with bad initialization. It can also converge slowly when components overlap heavily (the responsibilities are all close to , giving little information). In extreme cases, a component can collapse onto a single data point (), producing an unbounded likelihood. Regularization of the covariance (e.g., ) prevents this.
Choosing K: Model Selection
The number of components is a hyperparameter. More components fit the data better (higher likelihood) but risk overfitting. Standard model selection criteria penalize model complexity:
Bayesian Information Criterion (BIC)
where is the number of free parameters and is the sample size. For a GMM with components in dimensions:
(each component has covariance parameters, mean parameters, and 1 weight parameter, minus 1 for the constraint ).
Choose minimizing BIC. BIC penalizes complexity more heavily than AIC and tends to select simpler models.
Akaike Information Criterion (AIC)
AIC replaces with 2, penalizing complexity less than BIC. AIC tends to select larger and is preferred when the goal is prediction rather than model identification.
Silhouette score: a non-probabilistic alternative. For each point, compare the average distance to its own cluster with the average distance to the nearest other cluster. Silhouette values range from to ; higher is better. Choose maximizing the average silhouette score.
Connection to K-Means
K-Means as the Hard-Assignment Limit of GMMs
Statement
Consider a GMM with isotropic covariances for all . As , the EM algorithm for this GMM reduces to the k-means algorithm (Lloyd's algorithm):
- The E-step responsibilities become hard assignments: for the nearest centroid and otherwise
- The M-step mean updates become the k-means centroid updates
Intuition
When is very small, each Gaussian component is a narrow spike at . The responsibility is determined almost entirely by the distance : the closest component gets responsibility , all others get . This is exactly hard assignment to the nearest centroid. The M-step becomes the centroid update .
Proof Sketch
With , the Gaussian density is:
The responsibility ratio for components and :
As , the exponential drives the ratio to if is closer, or if is closer. The assignment becomes hard: for , and otherwise. This is exactly the k-means assignment step.
Why It Matters
This connection explains why k-means works well when clusters are approximately spherical and equally sized: it is the correct algorithm for isotropic equal-weight GMMs. It also explains when k-means fails: clusters with very different sizes, shapes (non-spherical), or densities violate the isotropic equal-weight assumption. GMMs handle these cases because they learn separate and for each component.
Failure Mode
The limit assumes equal mixing weights and shared isotropic covariance. If different components have different covariances, the limit of the full GMM does not reduce to standard k-means but to a weighted version. K-means implicitly assumes all clusters have the same shape and size.
Canonical Examples
EM for a 2-component 1D GMM
Data: points at . Fit a 2-component GMM with known variance and equal weights .
Initialize: .
E-step: For : . . So -- point 3 belongs overwhelmingly to component 1.
For : symmetrically, .
M-step: (near-hard assignment for well-separated components). Similarly .
After a few iterations, EM converges to and .
When k-means fails but GMMs succeed
Consider two clusters: one small and tight (50 points near the origin, ), one large and diffuse (200 points spread over a wide area, ). K-means assigns points based purely on distance, so it splits the large cluster to equalize sizes. GMMs learn different covariances and , correctly identifying the two clusters regardless of their size difference.
Common Confusions
GMMs can model non-Gaussian shapes via superposition
A single Gaussian component is always ellipsoidal. But a mixture of Gaussians can approximate any continuous density to arbitrary accuracy (this is a universal approximation result for mixtures). A ring-shaped cluster can be modeled by many small Gaussians arranged in a ring. The power of GMMs is in the mixture, not in the individual components.
EM does not guarantee finding the right K
EM fits a GMM with a given . It does not tell you what to use. Choosing requires model selection (BIC, AIC, cross-validation, or Bayesian nonparametric approaches like Dirichlet process mixtures). Running EM with the wrong gives a perfectly valid local maximum of the wrong model.
Responsibilities are not cluster assignments
Responsibilities are posterior probabilities, not binary assignments. A point with and is genuinely uncertain -- it sits between two overlapping clusters. Treating GMMs as hard clusterers by assigning each point to its highest-responsibility component loses information. The soft assignments are a feature, not a bug.
Summary
- GMM: each cluster is a Gaussian with its own , , and weight
- The log-likelihood has a sum inside the log, making direct optimization intractable
- E-step: compute responsibilities (posterior cluster probabilities) via Bayes' rule
- M-step: update using responsibilities as soft counts -- all updates are closed-form
- EM monotonically increases the log-likelihood but converges to local optima
- K-means is the limit of GMMs with isotropic covariance as (hard assignment = soft assignment in the zero-variance limit)
- Choose with BIC (conservative) or AIC (liberal), not with the log-likelihood alone
Exercises
Problem
Write the E-step update for a 2-component GMM in 1D with parameters . Compute the responsibility for a data point when .
Problem
Prove that k-means is a special case of EM for GMMs. Specifically, show that with isotropic covariances and equal weights, the E-step reduces to hard assignment and the M-step reduces to centroid updates as .
References
Canonical:
- Bishop, Pattern Recognition and Machine Learning, Chapter 9 -- the definitive textbook treatment
- McLachlan & Peel, Finite Mixture Models (2000)
- Dempster, Laird, Rubin, "Maximum Likelihood from Incomplete Data via the EM Algorithm" (1977)
Current:
-
Murphy, Probabilistic Machine Learning: An Introduction (2022), Chapter 8
-
Schwarz, "Estimating the Dimension of a Model" (1978) -- the BIC paper
-
Hastie, Tibshirani, Friedman, The Elements of Statistical Learning (2009), Chapters 3-15
Next Topics
Building on GMMs and EM:
- Variational autoencoders: replacing the E-step with neural network amortized inference and extending to continuous latent spaces
Last reviewed: April 2026
Prerequisites
Foundations this topic depends on.
- K-Means ClusteringLayer 1
- Common Probability DistributionsLayer 0A
- Sets, Functions, and RelationsLayer 0A
- Basic Logic and Proof TechniquesLayer 0A
- Convex Optimization BasicsLayer 1
- Differentiation in RnLayer 0A
- Matrix Operations and PropertiesLayer 0A
- The EM AlgorithmLayer 2
- Maximum Likelihood EstimationLayer 0B
Builds on This
- Mixture Density NetworksLayer 3