ML Methods
K-Means Clustering
Lloyd's algorithm for partitional clustering: the within-cluster sum of squares objective, convergence guarantees, k-means++ initialization, choosing k, and the connection to EM for Gaussians.
Why This Matters
K-means is the most widely used clustering algorithm in practice. Its simplicity. alternate between assigning points to nearest centroids and updating centroids. belies a rich mathematical structure. K-means is coordinate descent on a non-convex objective, it has clean convergence guarantees, and the k-means++ initialization provides a provable approximation ratio. Understanding k-means also provides the foundation for Gaussian mixture models: k-means is a hard-assignment, isotropic special case of EM.
Mental Model
You want to partition data points into clusters such that each point is close to its cluster center. K-means minimizes the total squared distance from each point to its assigned centroid. The algorithm alternates: (1) assign each point to the nearest centroid, (2) recompute each centroid as the mean of its assigned points. This monotonically decreases the objective and must converge.
The Objective
Within-Cluster Sum of Squares (WCSS)
Given data , a set of centroids , and an assignment function , the WCSS objective is:
K-means seeks to minimize over both assignments and centroids . This is NP-hard in general.
Lloyd's Algorithm
Lloyd's Algorithm
Input: Data , number of clusters , initial centroids .
Repeat until convergence:
-
Assign: For each , set (assign to nearest centroid)
-
Update: For each , set where (centroid = cluster mean)
Output: Assignments and centroids .
K-Means as Coordinate Descent
The WCSS objective depends on two sets of variables: assignments and centroids . Lloyd's algorithm alternates minimizing over each while holding the other fixed:
-
Assignment step (fix , minimize over ): For each point , assigning it to the nearest centroid minimizes its contribution . Since points are independent, this minimizes over .
-
Update step (fix , minimize over ): For each cluster , the centroid that minimizes is the mean . This follows from setting the gradient to zero.
This is block coordinate descent on a non-convex objective.
Convergence
K-Means Convergence
Statement
Lloyd's algorithm converges in a finite number of iterations. Specifically:
- The objective is non-increasing at every step
- Each step either strictly decreases or the algorithm has converged
- The number of distinct assignments is at most , so the algorithm terminates in at most steps
Intuition
Each step (assign or update) can only decrease or maintain , never increase it. Since the dataset is finite, there are finitely many possible assignments (at most partitions). The algorithm cannot cycle because strictly decreases whenever the assignment changes. Therefore it must terminate.
Proof Sketch
Assignment step decreases : Reassigning point from cluster to cluster with strictly reduces , hence .
Update step decreases : For fixed assignments, the function is strictly convex with unique minimum at the mean. So updating to the cluster mean either decreases or leaves it unchanged (if was already the mean).
Since is non-negative, bounded below by 0, and decreasing at each step with finitely many possible states, the algorithm terminates.
Why It Matters
Convergence is guaranteed, but convergence to the global optimum is not. K-means converges to a local minimum that depends on the initialization. This is why initialization matters so much and why k-means is typically run multiple times with different random seeds.
Failure Mode
Worst-case convergence can require exponentially many iterations (), though this is almost never observed in practice. Typical convergence is fast (a few dozen iterations). However, the local minimum found may be arbitrarily worse than the global optimum without good initialization.
K-Means++ Initialization
K-Means++ Approximation Guarantee
Statement
The k-means++ initialization procedure:
- Choose the first centroid uniformly at random from the data
- For : choose with probability proportional to
This yields initial centroids whose expected WCSS satisfies:
where is the optimal WCSS. That is, k-means++ is -competitive in expectation.
Intuition
K-means++ chooses each new centroid with probability proportional to its squared distance from the nearest existing centroid. Points far from all current centroids are more likely to be chosen. This spreads the initial centroids across the data, avoiding the failure mode where random initialization places multiple centroids in the same cluster.
Proof Sketch
The key idea: after choosing the first centroid randomly, the expected cost of the points assigned to a single optimal cluster is at most (the optimal cost of that cluster). This uses the fact that the -weighting ensures good coverage. Summing over clusters and accounting for the choices gives the factor via a coupon-collector-like argument.
Why It Matters
Before k-means++, common practice was random initialization (pick random data points), which has no approximation guarantee and frequently produces terrible clusterings. K-means++ gives a provable guarantee and is nearly free to compute. It is now the default initialization in scikit-learn and most implementations.
Failure Mode
The guarantee is in expectation. Individual runs can still produce bad initializations. The standard practice is to run k-means++ multiple times (e.g., 10 runs) and keep the best result.
Choosing
There is no universal method for choosing the number of clusters. Common approaches:
Elbow method: Plot WCSS vs for . Look for an "elbow" where the WCSS stops decreasing rapidly. Subjective and often ambiguous.
Silhouette score: For each point, compute where is the mean distance to points in the same cluster and is the mean distance to points in the nearest other cluster. Average over all points. Choose maximizing the average silhouette (range ).
Gap statistic: Compare to its expectation under a null reference distribution (uniform). Choose the smallest where the gap exceeds a threshold. More principled than the elbow method.
Connection to EM for Gaussians
K-means is a special case of the EM algorithm for Gaussian mixture models where all clusters have the same isotropic covariance and :
| K-Means | Gaussian EM |
|---|---|
| Hard assignment (each point to one cluster) | Soft assignment (probabilities over clusters) |
| Centroid = cluster mean | Mean = weighted average (soft assignments) |
| All clusters same "shape" | Each cluster has its own covariance |
| Minimizes WCSS | Maximizes log-likelihood |
As the covariance shrinks to zero, the soft assignments in EM become hard (all probability mass on the nearest centroid), and EM reduces to k-means.
Common Confusions
K-means does NOT find the global optimum
K-means is guaranteed to converge, but to a local minimum of the WCSS objective, which depends on initialization. Different initializations produce different clusterings with different WCSS values. The global minimum of WCSS is NP-hard to find. K-means++ provides a good starting point but does not guarantee global optimality.
K-means assumes spherical clusters
K-means uses Euclidean distance and assigns each point to the nearest centroid, which implicitly assumes clusters are roughly spherical and equally sized. It performs poorly on elongated, non-convex, or differently-sized clusters. For such data, use Gaussian mixture models (which model elliptical clusters) or spectral clustering (which handles non-convex shapes).
Summary
- K-means minimizes WCSS:
- Lloyd's algorithm: assign, update, repeat (coordinate descent)
- Convergence: objective is monotonically non-increasing, terminates in finite steps
- Converges to local minimum, not global. initialization matters
- K-means++ initialization: -competitive, choose proportional to
- K-means is hard-assignment EM for isotropic Gaussians
Exercises
Problem
Prove that the centroid update step (setting to the mean of cluster ) minimizes over .
Problem
Show that k-means with (one cluster per point) achieves , and k-means with gives with (the total variance). Explain why neither extreme is useful and why choosing requires a tradeoff.
References
Canonical:
- Lloyd, "Least Squares Quantization in PCM" (1982). The original algorithm (written in 1957)
- Arthur & Vassilvitskii, "K-Means++: The Advantages of Careful Seeding" (2007)
Current:
-
Hastie, Tibshirani, Friedman, The Elements of Statistical Learning (2009), Chapter 14
-
Shalev-Shwartz & Ben-David, Understanding Machine Learning (2014), Chapter 22
-
Bishop, Pattern Recognition and Machine Learning (2006), Chapters 1-14
-
Murphy, Machine Learning: A Probabilistic Perspective (2012), Chapters 1-28
Next Topics
The natural next step from k-means:
- Gaussian Mixture Models and EM: soft-assignment clustering with full covariance structure
Last reviewed: April 2026
Prerequisites
Foundations this topic depends on.
- Common Probability DistributionsLayer 0A
- Sets, Functions, and RelationsLayer 0A
- Basic Logic and Proof TechniquesLayer 0A
- Convex Optimization BasicsLayer 1
- Differentiation in RnLayer 0A
- Matrix Operations and PropertiesLayer 0A
Builds on This
- Gaussian Mixture Models and EMLayer 2
- Spectral ClusteringLayer 2