Beta. Content is under active construction and has not been peer-reviewed. Report errors on GitHub.Disclaimer

ML Methods

K-Means Clustering

Lloyd's algorithm for partitional clustering: the within-cluster sum of squares objective, convergence guarantees, k-means++ initialization, choosing k, and the connection to EM for Gaussians.

CoreTier 1Stable~40 min

Why This Matters

K-means is the most widely used clustering algorithm in practice. Its simplicity. alternate between assigning points to nearest centroids and updating centroids. belies a rich mathematical structure. K-means is coordinate descent on a non-convex objective, it has clean convergence guarantees, and the k-means++ initialization provides a provable approximation ratio. Understanding k-means also provides the foundation for Gaussian mixture models: k-means is a hard-assignment, isotropic special case of EM.

Iteration 1Iteration 2Iteration 3Centroid (+)Data point (color = cluster)Centroids converge as assignments stabilize

Mental Model

You want to partition nn data points into kk clusters such that each point is close to its cluster center. K-means minimizes the total squared distance from each point to its assigned centroid. The algorithm alternates: (1) assign each point to the nearest centroid, (2) recompute each centroid as the mean of its assigned points. This monotonically decreases the objective and must converge.

The Objective

Definition

Within-Cluster Sum of Squares (WCSS)

Given data {x1,,xn}Rd\{x_1, \ldots, x_n\} \subset \mathbb{R}^d, a set of kk centroids {μ1,,μk}\{\mu_1, \ldots, \mu_k\}, and an assignment function c:{1,,n}{1,,k}c: \{1,\ldots,n\} \to \{1,\ldots,k\}, the WCSS objective is:

J(c,μ)=i=1nxiμc(i)2J(c, \mu) = \sum_{i=1}^n \|x_i - \mu_{c(i)}\|^2

K-means seeks to minimize JJ over both assignments cc and centroids μ\mu. This is NP-hard in general.

Lloyd's Algorithm

Definition

Lloyd's Algorithm

Input: Data {x1,,xn}\{x_1, \ldots, x_n\}, number of clusters kk, initial centroids μ1(0),,μk(0)\mu_1^{(0)}, \ldots, \mu_k^{(0)}.

Repeat until convergence:

  1. Assign: For each ii, set c(i)=argminjxiμj2c(i) = \arg\min_j \|x_i - \mu_j\|^2 (assign to nearest centroid)

  2. Update: For each jj, set μj=1CjiCjxi\mu_j = \frac{1}{|C_j|}\sum_{i \in C_j} x_i where Cj={i:c(i)=j}C_j = \{i : c(i) = j\} (centroid = cluster mean)

Output: Assignments cc and centroids μ\mu.

K-Means as Coordinate Descent

The WCSS objective J(c,μ)J(c, \mu) depends on two sets of variables: assignments cc and centroids μ\mu. Lloyd's algorithm alternates minimizing over each while holding the other fixed:

  1. Assignment step (fix μ\mu, minimize over cc): For each point xix_i, assigning it to the nearest centroid minimizes its contribution xiμc(i)2\|x_i - \mu_{c(i)}\|^2. Since points are independent, this minimizes JJ over cc.

  2. Update step (fix cc, minimize over μ\mu): For each cluster jj, the centroid μj\mu_j that minimizes iCjxiμj2\sum_{i \in C_j}\|x_i - \mu_j\|^2 is the mean xˉj=1CjiCjxi\bar{x}_j = \frac{1}{|C_j|}\sum_{i \in C_j} x_i. This follows from setting the gradient to zero.

This is block coordinate descent on a non-convex objective.

Convergence

Theorem

K-Means Convergence

Statement

Lloyd's algorithm converges in a finite number of iterations. Specifically:

  1. The objective JJ is non-increasing at every step
  2. Each step either strictly decreases JJ or the algorithm has converged
  3. The number of distinct assignments is at most knk^n, so the algorithm terminates in at most knk^n steps

Intuition

Each step (assign or update) can only decrease or maintain JJ, never increase it. Since the dataset is finite, there are finitely many possible assignments (at most knk^n partitions). The algorithm cannot cycle because JJ strictly decreases whenever the assignment changes. Therefore it must terminate.

Proof Sketch

Assignment step decreases JJ: Reassigning point ii from cluster jj to cluster jj' with xiμj<xiμj\|x_i - \mu_{j'}\| < \|x_i - \mu_j\| strictly reduces xiμc(i)2\|x_i - \mu_{c(i)}\|^2, hence JJ.

Update step decreases JJ: For fixed assignments, the function μjiCjxiμj2\mu_j \mapsto \sum_{i \in C_j}\|x_i - \mu_j\|^2 is strictly convex with unique minimum at the mean. So updating μj\mu_j to the cluster mean either decreases JJ or leaves it unchanged (if μj\mu_j was already the mean).

Since JJ is non-negative, bounded below by 0, and decreasing at each step with finitely many possible states, the algorithm terminates.

Why It Matters

Convergence is guaranteed, but convergence to the global optimum is not. K-means converges to a local minimum that depends on the initialization. This is why initialization matters so much and why k-means is typically run multiple times with different random seeds.

Failure Mode

Worst-case convergence can require exponentially many iterations (knk^n), though this is almost never observed in practice. Typical convergence is fast (a few dozen iterations). However, the local minimum found may be arbitrarily worse than the global optimum without good initialization.

K-Means++ Initialization

Theorem

K-Means++ Approximation Guarantee

Statement

The k-means++ initialization procedure:

  1. Choose the first centroid μ1\mu_1 uniformly at random from the data
  2. For j=2,,kj = 2, \ldots, k: choose μj=xi\mu_j = x_i with probability proportional to minl<jxiμl2\min_{l < j}\|x_i - \mu_l\|^2

This yields initial centroids whose expected WCSS satisfies:

E[Jinit]8(lnk+2)J\mathbb{E}[J_{\text{init}}] \leq 8(\ln k + 2) \cdot J^*

where JJ^* is the optimal WCSS. That is, k-means++ is O(logk)O(\log k)-competitive in expectation.

Intuition

K-means++ chooses each new centroid with probability proportional to its squared distance from the nearest existing centroid. Points far from all current centroids are more likely to be chosen. This spreads the initial centroids across the data, avoiding the failure mode where random initialization places multiple centroids in the same cluster.

Proof Sketch

The key idea: after choosing the first centroid randomly, the expected cost of the points assigned to a single optimal cluster is at most 8Jj8 J^*_j (the optimal cost of that cluster). This uses the fact that the D2D^2-weighting ensures good coverage. Summing over clusters and accounting for the kk choices gives the O(lnk)O(\ln k) factor via a coupon-collector-like argument.

Why It Matters

Before k-means++, common practice was random initialization (pick kk random data points), which has no approximation guarantee and frequently produces terrible clusterings. K-means++ gives a provable guarantee and is nearly free to compute. It is now the default initialization in scikit-learn and most implementations.

Failure Mode

The O(logk)O(\log k) guarantee is in expectation. Individual runs can still produce bad initializations. The standard practice is to run k-means++ multiple times (e.g., 10 runs) and keep the best result.

Choosing kk

There is no universal method for choosing the number of clusters. Common approaches:

Elbow method: Plot WCSS vs kk for k=1,2,k = 1, 2, \ldots. Look for an "elbow" where the WCSS stops decreasing rapidly. Subjective and often ambiguous.

Silhouette score: For each point, compute s(i)=b(i)a(i)max(a(i),b(i))s(i) = \frac{b(i) - a(i)}{\max(a(i), b(i))} where a(i)a(i) is the mean distance to points in the same cluster and b(i)b(i) is the mean distance to points in the nearest other cluster. Average over all points. Choose kk maximizing the average silhouette (range [1,1][-1, 1]).

Gap statistic: Compare log(WCSSk)\log(\text{WCSS}_k) to its expectation under a null reference distribution (uniform). Choose the smallest kk where the gap exceeds a threshold. More principled than the elbow method.

Connection to EM for Gaussians

K-means is a special case of the EM algorithm for Gaussian mixture models where all clusters have the same isotropic covariance σ2I\sigma^2 I and σ20\sigma^2 \to 0:

K-MeansGaussian EM
Hard assignment (each point to one cluster)Soft assignment (probabilities over clusters)
Centroid = cluster meanMean = weighted average (soft assignments)
All clusters same "shape"Each cluster has its own covariance
Minimizes WCSSMaximizes log-likelihood

As the covariance shrinks to zero, the soft assignments in EM become hard (all probability mass on the nearest centroid), and EM reduces to k-means.

Common Confusions

Watch Out

K-means does NOT find the global optimum

K-means is guaranteed to converge, but to a local minimum of the WCSS objective, which depends on initialization. Different initializations produce different clusterings with different WCSS values. The global minimum of WCSS is NP-hard to find. K-means++ provides a good starting point but does not guarantee global optimality.

Watch Out

K-means assumes spherical clusters

K-means uses Euclidean distance and assigns each point to the nearest centroid, which implicitly assumes clusters are roughly spherical and equally sized. It performs poorly on elongated, non-convex, or differently-sized clusters. For such data, use Gaussian mixture models (which model elliptical clusters) or spectral clustering (which handles non-convex shapes).

Summary

  • K-means minimizes WCSS: J=ixiμc(i)2J = \sum_i \|x_i - \mu_{c(i)}\|^2
  • Lloyd's algorithm: assign, update, repeat (coordinate descent)
  • Convergence: objective is monotonically non-increasing, terminates in finite steps
  • Converges to local minimum, not global. initialization matters
  • K-means++ initialization: O(logk)O(\log k)-competitive, choose proportional to D2D^2
  • K-means is hard-assignment EM for isotropic Gaussians

Exercises

ExerciseCore

Problem

Prove that the centroid update step (setting μj\mu_j to the mean of cluster CjC_j) minimizes iCjxiμj2\sum_{i \in C_j} \|x_i - \mu_j\|^2 over μj\mu_j.

ExerciseAdvanced

Problem

Show that k-means with k=nk = n (one cluster per point) achieves J=0J = 0, and k-means with k=1k = 1 gives μ=xˉ\mu = \bar{x} with J=ixixˉ2J = \sum_i \|x_i - \bar{x}\|^2 (the total variance). Explain why neither extreme is useful and why choosing kk requires a tradeoff.

References

Canonical:

  • Lloyd, "Least Squares Quantization in PCM" (1982). The original algorithm (written in 1957)
  • Arthur & Vassilvitskii, "K-Means++: The Advantages of Careful Seeding" (2007)

Current:

  • Hastie, Tibshirani, Friedman, The Elements of Statistical Learning (2009), Chapter 14

  • Shalev-Shwartz & Ben-David, Understanding Machine Learning (2014), Chapter 22

  • Bishop, Pattern Recognition and Machine Learning (2006), Chapters 1-14

  • Murphy, Machine Learning: A Probabilistic Perspective (2012), Chapters 1-28

Next Topics

The natural next step from k-means:

Last reviewed: April 2026

Prerequisites

Foundations this topic depends on.

Builds on This

Next Topics