K-Means Clustering

Sneiderman, Robby

ML Methods

K-Means Clustering

Lloyd's algorithm for partitional clustering: the within-cluster sum of squares objective, convergence guarantees, k-means++ initialization, choosing k, and the connection to EM for Gaussians.

CoreTier 1StableSupporting~40 min

Prerequisites

Common Probability Distributions Convex Optimization Basics Nmf Nonnegative Matrix Factorization Self Organizing Maps

Start 8-question practice · 2 available Prereq Map

Learning position

Read this page in the graph.

ml-methods | layer 1 | tier 1. This page has 5 direct prerequisites and 3 published dependents.

Open Atlas Prerequisites Leads to

What next

Gaussian Mixture Models and EM

This is the first curated or graph-derived continuation from the current page.

Evidence badge

Claim status

This page has no public Lean mapping yet. Use the evidence page to inspect how claim status labels work.

Show the backing system

AtlasOpen the full prerequisite graph and run grounding traces.EvidenceInspect source support, claim labels, and public trust status.LeanReview the checked declaration list, scopes, and axiom profile.

Why This Matters

K-means is the most widely used clustering algorithm in practice. Its simplicity. alternate between assigning points to nearest centroids and updating centroids. belies a rich mathematical structure. K-means is coordinate descent on a non-convex objective, it has clean convergence guarantees, and the k-means++ initialization provides a provable approximation ratio. Understanding k-means also provides the foundation for Gaussian mixture models: k-means is a hard-assignment, isotropic special case of EM.

Mental Model

You want to partition $n$ data points into $k$ clusters such that each point is close to its cluster center. K-means minimizes the total squared distance from each point to its assigned centroid. The algorithm alternates: (1) assign each point to the nearest centroid, (2) recompute each centroid as the mean of its assigned points. This monotonically decreases the objective and must converge.

The Objective

Definition

Within-Cluster Sum of Squares (WCSS) $J$

Given data $\{x_1, \ldots, x_n\} \subset \mathbb{R}^d$ , a set of $k$ centroids $\{\mu_1, \ldots, \mu_k\}$ , and an assignment function $c: \{1,\ldots,n\} \to \{1,\ldots,k\}$ , the WCSS objective is:

$J(c, \mu) = \sum_{i=1}^n \|x_i - \mu_{c(i)}\|^2$

K-means seeks to minimize $J$ over both assignments $c$ and centroids $\mu$ . This is NP-hard in general.

Lloyd's Algorithm

Definition

Lloyd's Algorithm

Input: Data $\{x_1, \ldots, x_n\}$ , number of clusters $k$ , initial centroids $\mu_1^{(0)}, \ldots, \mu_k^{(0)}$ .

Repeat until convergence:

Assign: For each $i$ , set $c(i) = \arg\min_j \|x_i - \mu_j\|^2$ (assign to nearest centroid)
Update: For each $j$ , set $\mu_j = \frac{1}{|C_j|}\sum_{i \in C_j} x_i$ where $C_j = \{i : c(i) = j\}$ (centroid = cluster mean)

Output: Assignments $c$ and centroids $\mu$ .

K-Means as Coordinate Descent

The WCSS objective $J(c, \mu)$ depends on two sets of variables: assignments $c$ and centroids $\mu$ . Lloyd's algorithm alternates minimizing over each while holding the other fixed:

Assignment step (fix $\mu$ , minimize over $c$ ): For each point $x_i$ , assigning it to the nearest centroid minimizes its contribution $\|x_i - \mu_{c(i)}\|^2$ . Since points are independent, this minimizes $J$ over $c$ .
Update step (fix $c$ , minimize over $\mu$ ): For each cluster $j$ , the centroid $\mu_j$ that minimizes $\sum_{i \in C_j}\|x_i - \mu_j\|^2$ is the mean $\bar{x}_j = \frac{1}{|C_j|}\sum_{i \in C_j} x_i$ . This follows from setting the gradient to zero.

This is block coordinate descent on a non-convex objective.

Convergence

Theorem

K-Means Convergence

Statement

Lloyd's algorithm converges in a finite number of iterations. Specifically:

The objective $J$ is non-increasing at every step
Each step either strictly decreases $J$ or the algorithm has converged
The number of distinct assignments is at most $k^n$ , so the algorithm terminates in at most $k^n$ steps

Intuition

Each step (assign or update) can only decrease or maintain $J$ , never increase it. Since the dataset is finite, there are finitely many possible assignments (at most $k^n$ partitions). The algorithm cannot cycle because $J$ strictly decreases whenever the assignment changes. Therefore it must terminate.

Proof Sketch

Assignment step decreases $J$ : Reassigning point $i$ from cluster $j$ to cluster $j'$ with $\|x_i - \mu_{j'}\| < \|x_i - \mu_j\|$ strictly reduces $\|x_i - \mu_{c(i)}\|^2$ , hence $J$ .

Update step decreases $J$ : For fixed assignments, the function $\mu_j \mapsto \sum_{i \in C_j}\|x_i - \mu_j\|^2$ is strictly convex with unique minimum at the mean. So updating $\mu_j$ to the cluster mean either decreases $J$ or leaves it unchanged (if $\mu_j$ was already the mean).

Since $J$ is non-negative, bounded below by 0, and decreasing at each step with finitely many possible states, the algorithm terminates.

Why It Matters

Convergence is guaranteed, but convergence to the global optimum is not. K-means converges to a local minimum that depends on the initialization. This is why initialization matters so much and why k-means is typically run multiple times with different random seeds.

Failure Mode

Worst-case convergence can require exponentially many iterations ( $k^n$ ), though this is almost never observed in practice. Typical convergence is fast (a few dozen iterations). However, the local minimum found may be arbitrarily worse than the global optimum without good initialization.

report a correction →

K-Means++ Initialization

Theorem

K-Means++ Approximation Guarantee

Statement

The k-means++ initialization procedure:

Choose the first centroid $\mu_1$ uniformly at random from the data
For $j = 2, \ldots, k$ : choose $\mu_j = x_i$ with probability proportional to $\min_{l < j}\|x_i - \mu_l\|^2$

This yields initial centroids whose expected WCSS satisfies:

$\mathbb{E}[J_{\text{init}}] \leq 8(\ln k + 2) \cdot J^*$

where $J^*$ is the optimal WCSS. That is, k-means++ is $O(\log k)$ -competitive in expectation.

Intuition

K-means++ chooses each new centroid with probability proportional to its squared distance from the nearest existing centroid. Points far from all current centroids are more likely to be chosen. This spreads the initial centroids across the data, avoiding the failure mode where random initialization places multiple centroids in the same cluster.

Proof Sketch

The key idea: after choosing the first centroid randomly, the expected cost of the points assigned to a single optimal cluster is at most $8 J^*_j$ (the optimal cost of that cluster). This uses the fact that the $D^2$ -weighting ensures good coverage. Summing over clusters and accounting for the $k$ choices gives the $O(\ln k)$ factor via a coupon-collector-like argument.

Why It Matters

Before k-means++, common practice was random initialization (pick $k$ random data points), which has no approximation guarantee and frequently produces terrible clusterings. K-means++ gives a provable guarantee and is nearly free to compute. It is now the default initialization in scikit-learn and most implementations.

Failure Mode

The $O(\log k)$ guarantee is in expectation. Individual runs can still produce bad initializations. The standard practice is to run k-means++ multiple times (e.g., 10 runs) and keep the best result.

report a correction →

Choosing $k$

There is no universal method for choosing the number of clusters. Common approaches:

Elbow method: Plot WCSS vs $k$ for $k = 1, 2, \ldots$ . Look for an "elbow" where the WCSS stops decreasing rapidly. Subjective and often ambiguous.

Silhouette score: For each point, compute $s(i) = \frac{b(i) - a(i)}{\max(a(i), b(i))}$ where $a(i)$ is the mean distance to points in the same cluster and $b(i)$ is the mean distance to points in the nearest other cluster. Average over all points. Choose $k$ maximizing the average silhouette (range $[-1, 1]$ ).

Gap statistic: Compare $\log(\text{WCSS}_k)$ to its expectation under a null reference distribution (uniform). Choose the smallest $k$ where the gap exceeds a threshold. More principled than the elbow method.

Connection to EM for Gaussians

K-means is a special case of the EM algorithm for Gaussian mixture models where all clusters have the same isotropic covariance $\sigma^2 I$ and $\sigma^2 \to 0$ :

K-Means	Gaussian EM
Hard assignment (each point to one cluster)	Soft assignment (probabilities over clusters)
Centroid = cluster mean	Mean = weighted average (soft assignments)
All clusters same "shape"	Each cluster has its own covariance
Minimizes WCSS	Maximizes log-likelihood

As the covariance shrinks to zero, the soft assignments in EM become hard (all probability mass on the nearest centroid), and EM reduces to k-means.

Common Confusions

Watch Out

K-means does NOT find the global optimum

K-means is guaranteed to converge, but to a local minimum of the WCSS objective, which depends on initialization. Different initializations produce different clusterings with different WCSS values. The global minimum of WCSS is NP-hard to find. K-means++ provides a good starting point but does not guarantee global optimality.

Watch Out

K-means assumes spherical clusters

K-means uses Euclidean distance and assigns each point to the nearest centroid, which implicitly assumes clusters are roughly spherical and equally sized. It performs poorly on elongated, non-convex, or differently-sized clusters. For such data, use Gaussian mixture models (which model elliptical clusters) or spectral clustering (which handles non-convex shapes).

Summary

K-means minimizes WCSS: $J = \sum_i \|x_i - \mu_{c(i)}\|^2$
Lloyd's algorithm: assign, update, repeat (coordinate descent)
Convergence: objective is monotonically non-increasing, terminates in finite steps
Converges to local minimum, not global. initialization matters
K-means++ initialization: $O(\log k)$ -competitive, choose proportional to $D^2$
K-means is hard-assignment EM for isotropic Gaussians

Exercises

ExerciseCore

Problem

Prove that the centroid update step (setting $\mu_j$ to the mean of cluster $C_j$ ) minimizes $\sum_{i \in C_j} \|x_i - \mu_j\|^2$ over $\mu_j$ .

ExerciseAdvanced

Problem

Show that k-means with $k = n$ (one cluster per point) achieves $J = 0$ , and k-means with $k = 1$ gives $\mu = \bar{x}$ with $J = \sum_i \|x_i - \bar{x}\|^2$ (the total variance). Explain why neither extreme is useful and why choosing $k$ requires a tradeoff.

References

Canonical:

Lloyd, "Least Squares Quantization in PCM," IEEE Trans. Inf. Theory 28(2), 129-137 (1982); the original Bell Labs technical note dates from 1957.
MacQueen, "Some Methods for Classification and Analysis of Multivariate Observations," Proc. 5th Berkeley Symposium on Mathematical Statistics and Probability, Vol. 1, 281-297 (1967). Introduces the name "k-means."
Arthur & Vassilvitskii, "k-means++: The Advantages of Careful Seeding," Proc. SODA 2007, 1027-1035. The $O(\log k)$ -competitive initialization.

Choosing k / silhouette / gap:

Rousseeuw, "Silhouettes: A graphical aid to the interpretation and validation of cluster analysis," J. Computational and Applied Mathematics 20, 53-65 (1987).
Tibshirani, Walther, Hastie, "Estimating the number of clusters in a data set via the gap statistic," JRSS-B 63(2), 411-423 (2001).

Complexity and variants:

Vattani, "k-means requires exponentially many iterations even in the plane," Discrete and Computational Geometry 45, 596-616 (2011). Matches the $k^n$ upper bound with a superpolynomial construction.
Sculley, "Web-scale k-means clustering," WWW 2010, 1177-1178. Mini-batch k-means used in scikit-learn.

Textbook:

Hastie, Tibshirani, Friedman, The Elements of Statistical Learning (2nd ed., Springer 2009), Chapter 14.3.
Shalev-Shwartz & Ben-David, Understanding Machine Learning (Cambridge 2014), Chapter 22.

Next Topics

The natural next step from k-means:

Gaussian Mixture Models and EM: soft-assignment clustering with full covariance structure

Last reviewed: April 13, 2026

Canonical graph

Required before and derived from this topic

These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.

Full prerequisite chain All derived topics

Required prerequisites

5

Common Probability Distributionslayer 0A · tier 1
Convex Optimization Basicslayer 1 · tier 1
t-SNE and UMAPlayer 2 · tier 2
NMF (Nonnegative Matrix Factorization)layer 2 · tier 3
Self-Organizing Mapslayer 2 · tier 3

Derived topics

3

Gaussian Mixture Models and EMlayer 2 · tier 2
Spectral Clusteringlayer 2 · tier 2
Clustering for Gene Expressionlayer 4 · tier 3

Graph-backed continuations

Gaussian Mixture Models and EM Clustering for Gene Expression Spectral Clustering