Beta. Content is under active construction and has not been peer-reviewed. Report errors on GitHub.Disclaimer

ML Methods

Spectral Clustering

Clustering via the eigenvectors of a graph Laplacian: embed data using the bottom eigenvectors, then run k-means in the embedding space. Finds non-convex clusters that k-means alone cannot.

CoreTier 2Stable~50 min

Why This Matters

K-means clustering assigns points to the nearest centroid, which implicitly assumes clusters are convex and roughly spherical. Many real datasets have clusters that wrap around each other (two concentric rings, spiral shapes). K-means fails completely on these.

Spectral clustering solves this by first transforming the data using the eigenvectors of a graph Laplacian, then running k-means in the transformed space. The spectral embedding "unrolls" non-convex clusters into linearly separable ones.

Formal Setup

Given nn data points x1,,xnx_1, \ldots, x_n, construct a similarity graph with adjacency matrix WW where Wij0W_{ij} \geq 0 measures similarity between xix_i and xjx_j. Common choices:

  • Gaussian kernel: Wij=exp(xixj2/(2σ2))W_{ij} = \exp(-\|x_i - x_j\|^2 / (2\sigma^2))
  • k-nearest neighbors: Wij=1W_{ij} = 1 if xjx_j is among the kk nearest neighbors of xix_i (symmetrized)
  • epsilon-neighborhood: Wij=1W_{ij} = 1 if xixj<ϵ\|x_i - x_j\| < \epsilon
Definition

Graph Laplacian

Let DD be the diagonal degree matrix with Dii=jWijD_{ii} = \sum_j W_{ij}. The unnormalized graph Laplacian is:

L=DWL = D - W

The normalized graph Laplacian (symmetric version) is:

Lsym=D1/2LD1/2=ID1/2WD1/2L_{\text{sym}} = D^{-1/2} L D^{-1/2} = I - D^{-1/2} W D^{-1/2}

Main Theorems

Proposition

Properties of the Graph Laplacian

Statement

The unnormalized Laplacian L=DWL = D - W satisfies:

  1. LL is symmetric positive semidefinite.
  2. The smallest eigenvalue is 0, with eigenvector 1\mathbf{1} (the all-ones vector).
  3. The multiplicity of eigenvalue 0 equals the number of connected components of the graph.
  4. For any fRnf \in \mathbb{R}^n: fLf=12i,jWij(fifj)2f^\top L f = \frac{1}{2} \sum_{i,j} W_{ij}(f_i - f_j)^2.

Intuition

Property 4 is the key. fLff^\top L f is small when ff assigns similar values to connected nodes. The eigenvectors of LL with small eigenvalues are smooth functions on the graph. Points in the same cluster get similar eigenvector coordinates. This is why eigenvector embedding works for clustering.

Proof Sketch

For property 4: expand fLf=fDffWf=idifi2i,jWijfifjf^\top L f = f^\top D f - f^\top W f = \sum_i d_i f_i^2 - \sum_{i,j} W_{ij} f_i f_j. Rewrite as 12i,jWij(fi2+fj22fifj)\frac{1}{2}\sum_{i,j} W_{ij}(f_i^2 + f_j^2 - 2f_i f_j). PSD follows because this is a sum of nonnegative terms. For property 3: if the graph has kk connected components, construct kk indicator vectors (1 on one component, 0 elsewhere); these are orthogonal eigenvectors with eigenvalue 0.

Why It Matters

The quadratic form fLff^\top L f turns graph clustering into a continuous optimization problem: find vectors ff that minimize the Laplacian quadratic form subject to orthogonality. This is an eigenvalue problem with a known solution.

Failure Mode

The Laplacian depends heavily on the similarity graph construction. A bad choice of σ\sigma in the Gaussian kernel or kk in the nearest neighbor graph can merge distinct clusters or fragment a single cluster. There is no universal rule for choosing these hyperparameters.

Theorem

Cheeger Inequality

Statement

Let λ2\lambda_2 be the second smallest eigenvalue of LsymL_{\text{sym}} (the Fiedler value) and h(G)h(G) the Cheeger constant (minimum normalized cut):

h(G)=minSVcut(S,Sˉ)min(vol(S),vol(Sˉ))h(G) = \min_{\emptyset \neq S \subset V} \frac{\text{cut}(S, \bar{S})}{\min(\text{vol}(S), \text{vol}(\bar{S}))}

Then:

λ22h(G)2λ2\frac{\lambda_2}{2} \leq h(G) \leq \sqrt{2 \lambda_2}

Intuition

λ2\lambda_2 approximates the minimum normalized cut. When λ2\lambda_2 is small, there exists a near-partition that cuts few edges. The Fiedler vector (eigenvector for λ2\lambda_2) approximately encodes this partition. Thresholding the Fiedler vector gives a near-optimal 2-way cut.

Proof Sketch

The lower bound follows from the variational characterization of λ2\lambda_2 as minfD1/21fLsymf/ff\min_{f \perp D^{1/2}\mathbf{1}} f^\top L_{\text{sym}} f / f^\top f. The upper bound uses the indicator vector of the optimal Cheeger cut as a test function in the Rayleigh quotient.

Why It Matters

This theorem gives spectral clustering its theoretical justification. Finding the minimum normalized cut is NP-hard, but the spectral relaxation (computing λ2\lambda_2 and thresholding) gives a polynomial-time approximation with a guaranteed quality bound.

Failure Mode

The Cheeger inequality is loose. The factor-of-2λ2\sqrt{2\lambda_2} gap between the upper and lower bounds means the spectral approximation can be far from the true minimum cut. In high-dimensional data with noise, the spectral gap may be too small for reliable clustering.

The Algorithm

Unnormalized spectral clustering (Ng, Jordan, Weiss variant):

  1. Construct the similarity matrix WW and compute L=DWL = D - W.
  2. Compute the kk eigenvectors u1,,uku_1, \ldots, u_k corresponding to the kk smallest eigenvalues of LL.
  3. Form the matrix URn×kU \in \mathbb{R}^{n \times k} with columns u1,,uku_1, \ldots, u_k.
  4. Row-normalize UU so each row has unit norm.
  5. Run k-means on the rows of UU.

The normalized variant uses LsymL_{\text{sym}} instead of LL. In practice, the normalized version is almost always preferred because it accounts for varying node degrees.

Common Confusions

Watch Out

Spectral clustering is not PCA on a kernel matrix

PCA finds the top eigenvectors of the data covariance (or centered kernel matrix). Spectral clustering finds the bottom eigenvectors of the graph Laplacian. The Laplacian depends on the degree matrix DD, which PCA ignores. The two methods coincide only in special cases.

Watch Out

Number of clusters must still be chosen

Spectral clustering does not automatically determine kk. You still need to choose the number of clusters. The eigengap heuristic (choose kk where λk+1λk\lambda_{k+1} - \lambda_k is largest) is widely used but has no formal guarantee for noisy finite-sample data.

Canonical Examples

Example

Two concentric rings

Generate 500 points on two concentric circles with radii 1 and 3, plus small Gaussian noise (σ=0.1\sigma = 0.1). K-means with k=2k = 2 splits the data along a line, misclassifying roughly half the points. Spectral clustering with a Gaussian kernel (σ=0.5\sigma = 0.5) and the 2 smallest Laplacian eigenvectors perfectly separates the rings. The spectral embedding maps the inner and outer rings to two linearly separable point clouds.

Key Takeaways

  • The graph Laplacian quadratic form fLf=12i,jWij(fifj)2f^\top L f = \frac{1}{2}\sum_{i,j} W_{ij}(f_i - f_j)^2 penalizes dissimilar assignments
  • Bottom eigenvectors of LL encode cluster structure
  • Cheeger inequality links λ2\lambda_2 to the minimum normalized cut
  • Normalized Laplacian is preferred in practice to handle degree variation
  • Spectral clustering excels on non-convex clusters but requires choosing σ\sigma and kk

Exercises

ExerciseCore

Problem

Show that if a graph has exactly 2 connected components, the second eigenvector of LL (the Fiedler vector) is an indicator vector that assigns +c+c to nodes in one component and c-c to nodes in the other (for some constant cc).

ExerciseAdvanced

Problem

Consider nn points sampled from a mixture of two well-separated Gaussians in Rd\mathbb{R}^d. The Gaussian kernel similarity matrix WW has entries Wij=exp(xixj2/(2σ2))W_{ij} = \exp(-\|x_i - x_j\|^2 / (2\sigma^2)). Explain qualitatively why a poorly chosen σ\sigma (too large or too small) causes spectral clustering to fail. What happens in each case?

References

Canonical:

  • von Luxburg, "A Tutorial on Spectral Clustering" (2007), Sections 1-7
  • Ng, Jordan, Weiss, "On Spectral Clustering: Analysis and an Algorithm" (NIPS 2001)

Current:

  • Shalev-Shwartz & Ben-David, Understanding Machine Learning, Chapter 22

  • Hastie, Tibshirani, Friedman, The Elements of Statistical Learning (2009), Chapters 3-15

  • Bishop, Pattern Recognition and Machine Learning (2006), Chapters 1-14

  • Murphy, Machine Learning: A Probabilistic Perspective (2012), Chapters 1-28

Next Topics

Last reviewed: April 2026

Prerequisites

Foundations this topic depends on.

Next Topics