ML Methods
Spectral Clustering
Clustering via the eigenvectors of a graph Laplacian: embed data using the bottom eigenvectors, then run k-means in the embedding space. Finds non-convex clusters that k-means alone cannot.
Prerequisites
Why This Matters
K-means clustering assigns points to the nearest centroid, which implicitly assumes clusters are convex and roughly spherical. Many real datasets have clusters that wrap around each other (two concentric rings, spiral shapes). K-means fails completely on these.
Spectral clustering solves this by first transforming the data using the eigenvectors of a graph Laplacian, then running k-means in the transformed space. The spectral embedding "unrolls" non-convex clusters into linearly separable ones.
Formal Setup
Given data points , construct a similarity graph with adjacency matrix where measures similarity between and . Common choices:
- Gaussian kernel:
- k-nearest neighbors: if is among the nearest neighbors of (symmetrized)
- epsilon-neighborhood: if
Graph Laplacian
Let be the diagonal degree matrix with . The unnormalized graph Laplacian is:
The normalized graph Laplacian (symmetric version) is:
Main Theorems
Properties of the Graph Laplacian
Statement
The unnormalized Laplacian satisfies:
- is symmetric positive semidefinite.
- The smallest eigenvalue is 0, with eigenvector (the all-ones vector).
- The multiplicity of eigenvalue 0 equals the number of connected components of the graph.
- For any : .
Intuition
Property 4 is the key. is small when assigns similar values to connected nodes. The eigenvectors of with small eigenvalues are smooth functions on the graph. Points in the same cluster get similar eigenvector coordinates. This is why eigenvector embedding works for clustering.
Proof Sketch
For property 4: expand . Rewrite as . PSD follows because this is a sum of nonnegative terms. For property 3: if the graph has connected components, construct indicator vectors (1 on one component, 0 elsewhere); these are orthogonal eigenvectors with eigenvalue 0.
Why It Matters
The quadratic form turns graph clustering into a continuous optimization problem: find vectors that minimize the Laplacian quadratic form subject to orthogonality. This is an eigenvalue problem with a known solution.
Failure Mode
The Laplacian depends heavily on the similarity graph construction. A bad choice of in the Gaussian kernel or in the nearest neighbor graph can merge distinct clusters or fragment a single cluster. There is no universal rule for choosing these hyperparameters.
Cheeger Inequality
Statement
Let be the second smallest eigenvalue of (the Fiedler value) and the Cheeger constant (minimum normalized cut):
Then:
Intuition
approximates the minimum normalized cut. When is small, there exists a near-partition that cuts few edges. The Fiedler vector (eigenvector for ) approximately encodes this partition. Thresholding the Fiedler vector gives a near-optimal 2-way cut.
Proof Sketch
The lower bound follows from the variational characterization of as . The upper bound uses the indicator vector of the optimal Cheeger cut as a test function in the Rayleigh quotient.
Why It Matters
This theorem gives spectral clustering its theoretical justification. Finding the minimum normalized cut is NP-hard, but the spectral relaxation (computing and thresholding) gives a polynomial-time approximation with a guaranteed quality bound.
Failure Mode
The Cheeger inequality is loose. The factor-of- gap between the upper and lower bounds means the spectral approximation can be far from the true minimum cut. In high-dimensional data with noise, the spectral gap may be too small for reliable clustering.
The Algorithm
Unnormalized spectral clustering (Ng, Jordan, Weiss variant):
- Construct the similarity matrix and compute .
- Compute the eigenvectors corresponding to the smallest eigenvalues of .
- Form the matrix with columns .
- Row-normalize so each row has unit norm.
- Run k-means on the rows of .
The normalized variant uses instead of . In practice, the normalized version is almost always preferred because it accounts for varying node degrees.
Common Confusions
Spectral clustering is not PCA on a kernel matrix
PCA finds the top eigenvectors of the data covariance (or centered kernel matrix). Spectral clustering finds the bottom eigenvectors of the graph Laplacian. The Laplacian depends on the degree matrix , which PCA ignores. The two methods coincide only in special cases.
Number of clusters must still be chosen
Spectral clustering does not automatically determine . You still need to choose the number of clusters. The eigengap heuristic (choose where is largest) is widely used but has no formal guarantee for noisy finite-sample data.
Canonical Examples
Two concentric rings
Generate 500 points on two concentric circles with radii 1 and 3, plus small Gaussian noise (). K-means with splits the data along a line, misclassifying roughly half the points. Spectral clustering with a Gaussian kernel () and the 2 smallest Laplacian eigenvectors perfectly separates the rings. The spectral embedding maps the inner and outer rings to two linearly separable point clouds.
Key Takeaways
- The graph Laplacian quadratic form penalizes dissimilar assignments
- Bottom eigenvectors of encode cluster structure
- Cheeger inequality links to the minimum normalized cut
- Normalized Laplacian is preferred in practice to handle degree variation
- Spectral clustering excels on non-convex clusters but requires choosing and
Exercises
Problem
Show that if a graph has exactly 2 connected components, the second eigenvector of (the Fiedler vector) is an indicator vector that assigns to nodes in one component and to nodes in the other (for some constant ).
Problem
Consider points sampled from a mixture of two well-separated Gaussians in . The Gaussian kernel similarity matrix has entries . Explain qualitatively why a poorly chosen (too large or too small) causes spectral clustering to fail. What happens in each case?
References
Canonical:
- von Luxburg, "A Tutorial on Spectral Clustering" (2007), Sections 1-7
- Ng, Jordan, Weiss, "On Spectral Clustering: Analysis and an Algorithm" (NIPS 2001)
Current:
-
Shalev-Shwartz & Ben-David, Understanding Machine Learning, Chapter 22
-
Hastie, Tibshirani, Friedman, The Elements of Statistical Learning (2009), Chapters 3-15
-
Bishop, Pattern Recognition and Machine Learning (2006), Chapters 1-14
-
Murphy, Machine Learning: A Probabilistic Perspective (2012), Chapters 1-28
Next Topics
- Kernels and RKHS: the function space view of similarity
- Principal component analysis: another eigenvector-based method, for different purposes
Last reviewed: April 2026
Prerequisites
Foundations this topic depends on.
- Eigenvalues and EigenvectorsLayer 0A
- Matrix Operations and PropertiesLayer 0A
- Sets, Functions, and RelationsLayer 0A
- Basic Logic and Proof TechniquesLayer 0A
- K-Means ClusteringLayer 1
- Common Probability DistributionsLayer 0A
- Convex Optimization BasicsLayer 1
- Differentiation in RnLayer 0A