Spectral Clustering

Sneiderman, Robby

ML Methods

Spectral Clustering

Clustering via the eigenvectors of a graph Laplacian: embed data using the bottom eigenvectors, then run k-means in the embedding space. Finds non-convex clusters that k-means alone cannot.

CoreTier 2StableSupporting~50 min

Prerequisites

Eigenvalues and Eigenvectors K Means Clustering Pagerank Algorithm

Start 8-question practice · 1 available Prereq Map

Learning position

Read this page in the graph.

ml-methods | layer 2 | tier 2. This page has 3 direct prerequisites and 2 published dependents.

Open Atlas Prerequisites Leads to

What next

Kernels and Reproducing Kernel Hilbert Spaces

This is the first curated or graph-derived continuation from the current page.

Evidence badge

Claim status

This page has no public Lean mapping yet. Use the evidence page to inspect how claim status labels work.

Show the backing system

AtlasOpen the full prerequisite graph and run grounding traces.EvidenceInspect source support, claim labels, and public trust status.LeanReview the checked declaration list, scopes, and axiom profile.

Why This Matters

K-means clustering assigns points to the nearest centroid, which implicitly assumes clusters are convex and roughly spherical. Many real datasets have clusters that wrap around each other (two concentric rings, spiral shapes). K-means fails completely on these.

Spectral clustering solves this by first transforming the data using the eigenvectors of a graph Laplacian, then running k-means in the transformed space. The spectral embedding "unrolls" non-convex clusters into linearly separable ones.

Formal Setup

Given $n$ data points $x_1, \ldots, x_n$ , construct a similarity graph with adjacency matrix $W$ where $W_{ij} \geq 0$ measures similarity between $x_i$ and $x_j$ . Common choices:

Gaussian kernel: $W_{ij} = \exp(-\|x_i - x_j\|^2 / (2\sigma^2))$
k-nearest neighbors: $W_{ij} = 1$ if $x_j$ is among the $k$ nearest neighbors of $x_i$ (symmetrized)
epsilon-neighborhood: $W_{ij} = 1$ if $\|x_i - x_j\| < \epsilon$

Definition

Graph Laplacian $L$

Let $D$ be the diagonal degree matrix with $D_{ii} = \sum_j W_{ij}$ . The unnormalized graph Laplacian is:

$L = D - W$

The normalized graph Laplacian (symmetric version) is:

$L_{\text{sym}} = D^{-1/2} L D^{-1/2} = I - D^{-1/2} W D^{-1/2}$

Main Theorems

Proposition

Properties of the Graph Laplacian

Statement

The unnormalized Laplacian $L = D - W$ satisfies:

$L$ is symmetric positive semidefinite.
The smallest eigenvalue is 0, with eigenvector $\mathbf{1}$ (the all-ones vector).
The multiplicity of eigenvalue 0 equals the number of connected components of the graph.
For any $f \in \mathbb{R}^n$ : $f^\top L f = \frac{1}{2} \sum_{i,j} W_{ij}(f_i - f_j)^2$ .

Intuition

Property 4 is the key. $f^\top L f$ is small when $f$ assigns similar values to connected nodes. The eigenvectors of $L$ with small eigenvalues are smooth functions on the graph. Points in the same cluster get similar eigenvector coordinates. This is why eigenvector embedding works for clustering.

Proof Sketch

For property 4: expand $f^\top L f = f^\top D f - f^\top W f = \sum_i d_i f_i^2 - \sum_{i,j} W_{ij} f_i f_j$ . Rewrite as $\frac{1}{2}\sum_{i,j} W_{ij}(f_i^2 + f_j^2 - 2f_i f_j)$ . PSD follows because this is a sum of nonnegative terms. For property 3: if the graph has $k$ connected components, construct $k$ indicator vectors (1 on one component, 0 elsewhere); these are orthogonal eigenvectors with eigenvalue 0.

Why It Matters

The quadratic form $f^\top L f$ turns graph clustering into a continuous optimization problem: find vectors $f$ that minimize the Laplacian quadratic form subject to orthogonality. This is an eigenvalue problem with a known solution.

Failure Mode

The Laplacian depends heavily on the similarity graph construction. A bad choice of $\sigma$ in the Gaussian kernel or $k$ in the nearest neighbor graph can merge distinct clusters or fragment a single cluster. There is no universal rule for choosing these hyperparameters.

report a correction →

Theorem

Cheeger Inequality

Statement

Let $\lambda_2$ be the second smallest eigenvalue of $L_{\text{sym}}$ (the Fiedler value) and $h(G)$ the Cheeger constant (minimum normalized cut):

$h(G) = \min_{\emptyset \neq S \subset V} \frac{\text{cut}(S, \bar{S})}{\min(\text{vol}(S), \text{vol}(\bar{S}))}$

Then:

$\frac{\lambda_2}{2} \leq h(G) \leq \sqrt{2 \lambda_2}$

Intuition

$\lambda_2$ approximates the minimum normalized cut. When $\lambda_2$ is small, there exists a near-partition that cuts few edges. The Fiedler vector (eigenvector for $\lambda_2$ ) approximately encodes this partition. Thresholding the Fiedler vector gives a near-optimal 2-way cut.

Proof Sketch

The lower bound follows from the variational characterization of $\lambda_2$ as $\min_{f \perp D^{1/2}\mathbf{1}} f^\top L_{\text{sym}} f / f^\top f$ . The upper bound uses the indicator vector of the optimal Cheeger cut as a test function in the Rayleigh quotient.

Why It Matters

This theorem gives spectral clustering its theoretical justification. Finding the minimum normalized cut is NP-hard, but the spectral relaxation (computing $\lambda_2$ and thresholding) gives a polynomial-time approximation with a guaranteed quality bound.

Failure Mode

The Cheeger inequality is loose. The factor-of- $\sqrt{2\lambda_2}$ gap between the upper and lower bounds means the spectral approximation can be far from the true minimum cut. In high-dimensional data with noise, the spectral gap may be too small for reliable clustering.

report a correction →

The Algorithm

Normalized spectral clustering (Ng, Jordan, Weiss 2001):

Construct the similarity matrix $W$ and compute $L_{\text{sym}} = D^{-1/2}(D - W)D^{-1/2}$ .
Compute the $k$ eigenvectors $u_1, \ldots, u_k$ corresponding to the $k$ smallest eigenvalues of $L_{\text{sym}}$ .
Form the matrix $U \in \mathbb{R}^{n \times k}$ with columns $u_1, \ldots, u_k$ .
Row-normalize $U$ so each row has unit norm.
Run k-means on the rows of $U$ .

The unnormalized variant uses $L = D - W$ instead of $L_{\text{sym}}$ and skips row-normalization; Shi and Malik (2000) use a different normalization ( $D^{-1}L$ , the random-walk Laplacian). In practice, the normalized versions are almost always preferred because they account for varying node degrees.

Common Confusions

Watch Out

Spectral clustering is not PCA on a kernel matrix

PCA finds the top eigenvectors of the data covariance (or centered kernel matrix). Spectral clustering finds the bottom eigenvectors of the graph Laplacian. The Laplacian depends on the degree matrix $D$ , which PCA ignores. The two methods coincide only in special cases.

Watch Out

Number of clusters must still be chosen

Spectral clustering does not automatically determine $k$ . You still need to choose the number of clusters. The eigengap heuristic (choose $k$ where $\lambda_{k+1} - \lambda_k$ is largest) is widely used but has no formal guarantee for noisy finite-sample data.

Canonical Examples

Example

Two concentric rings

Generate 500 points on two concentric circles with radii 1 and 3, plus small Gaussian noise ( $\sigma = 0.1$ ). K-means with $k = 2$ splits the data along a line, misclassifying roughly half the points. Spectral clustering with a Gaussian kernel ( $\sigma = 0.5$ ) and the 2 smallest Laplacian eigenvectors perfectly separates the rings. The spectral embedding maps the inner and outer rings to two linearly separable point clouds.

Summary

The graph Laplacian quadratic form $f^\top L f = \frac{1}{2}\sum_{i,j} W_{ij}(f_i - f_j)^2$ penalizes dissimilar assignments
Bottom eigenvectors of $L$ encode cluster structure
Cheeger inequality links $\lambda_2$ to the minimum normalized cut
Normalized Laplacian is preferred in practice to handle degree variation
Spectral clustering excels on non-convex clusters but requires choosing $\sigma$ and $k$

Exercises

ExerciseCore

Problem

Show that if a graph has exactly 2 connected components, the second eigenvector of $L$ (the Fiedler vector) is an indicator vector that assigns $+c$ to nodes in one component and $-c$ to nodes in the other (for some constant $c$ ).

ExerciseAdvanced

Problem

Consider $n$ points sampled from a mixture of two well-separated Gaussians in $\mathbb{R}^d$ . The Gaussian kernel similarity matrix $W$ has entries $W_{ij} = \exp(-\|x_i - x_j\|^2 / (2\sigma^2))$ . Explain qualitatively why a poorly chosen $\sigma$ (too large or too small) causes spectral clustering to fail. What happens in each case?

References

Canonical:

von Luxburg, "A Tutorial on Spectral Clustering" (2007), Sections 1-7
Ng, Jordan, Weiss, "On Spectral Clustering: Analysis and an Algorithm" (NIPS 2001)

Foundational:

Shi & Malik, "Normalized Cuts and Image Segmentation" (IEEE TPAMI, 2000)
Chung, Spectral Graph Theory (AMS, 1997), Chapters 1-2

Current:

Shalev-Shwartz & Ben-David, Understanding Machine Learning, Chapter 22
Hastie, Tibshirani, Friedman, The Elements of Statistical Learning (2009), Chapter 14 (Unsupervised Learning), Section 14.5

Next Topics

Kernels and RKHS: the function space view of similarity
Principal component analysis: another eigenvector-based method, for different purposes

Last reviewed: April 14, 2026

Canonical graph

Required before and derived from this topic

These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.

Full prerequisite chain All derived topics

Required prerequisites

3

Eigenvalues and Eigenvectorslayer 0A · tier 1
K-Means Clusteringlayer 1 · tier 1
PageRank Algorithmlayer 2 · tier 2

Derived topics

2

Kernels and Reproducing Kernel Hilbert Spaceslayer 3 · tier 2
Clustering for Gene Expressionlayer 4 · tier 3

Graph-backed continuations

Kernels and Reproducing Kernel Hilbert Spaces Clustering for Gene Expression