Beta. Content is under active construction and has not been peer-reviewed. Report errors on GitHub.Disclaimer

ML Methods

Dimensionality Reduction Theory

Why and how to reduce dimensions: the curse of dimensionality, PCA, random projections (JL lemma), t-SNE, UMAP, and when each method preserves the structure you care about.

CoreTier 2Stable~50 min

Why This Matters

High-dimensional data creates three problems. First, distance metrics become meaningless: in high dimensions, the ratio of nearest-neighbor to farthest-neighbor distance converges to 1. Second, the volume of the space grows exponentially, so data becomes sparse and density estimation requires exponentially more samples. Third, computation scales with dimension. Dimensionality reduction addresses all three by mapping data to a lower-dimensional space while preserving relevant structure.

The Curse of Dimensionality

Definition

Curse of Dimensionality

As the dimension dd increases while the number of samples nn stays fixed, data becomes increasingly sparse. For a unit hypercube in Rd\mathbb{R}^d, the fraction of volume within distance ϵ\epsilon of the boundary is 1(12ϵ)d1 - (1 - 2\epsilon)^d, which approaches 1 as dd \to \infty. Nearly all data points are near the boundary. Neighborhoods that are "local" in low dimensions become global in high dimensions.

Concretely: to maintain the same density of samples in a dd-dimensional space as in a 1-dimensional space with nn points, you need ndn^d points. With 100 samples, a 10-dimensional space already feels empty.

Linear Methods

PCA (Review)

PCA projects onto the top kk eigenvectors of the covariance matrix, maximizing preserved variance. It is optimal among linear projections for reconstruction error (Eckart-Young theorem). PCA preserves global structure (large distances, principal directions of variance) but can miss nonlinear structure (curved manifolds).

Factor Analysis

Factor analysis models x=Wz+μ+ϵx = Wz + \mu + \epsilon where zN(0,I)z \sim \mathcal{N}(0, I) is a low-dimensional latent and ϵN(0,Ψ)\epsilon \sim \mathcal{N}(0, \Psi) with diagonal Ψ\Psi. Unlike PCA, factor analysis separates signal variance (WW) from noise variance (Ψ\Psi), making it more appropriate when features have different noise levels.

Random Projections

Lemma

Johnson-Lindenstrauss Lemma

Statement

For any set of nn points in Rd\mathbb{R}^d and any ε(0,1)\varepsilon \in (0, 1), there exists a linear map A:RdRkA: \mathbb{R}^d \to \mathbb{R}^k with k=O(ε2logn)k = O(\varepsilon^{-2} \log n) such that for all pairs i,ji, j:

(1ε)xixj2AxiAxj2(1+ε)xixj2(1 - \varepsilon)\|x_i - x_j\|^2 \leq \|Ax_i - Ax_j\|^2 \leq (1 + \varepsilon)\|x_i - x_j\|^2

Moreover, a random matrix AA with entries drawn i.i.d. from N(0,1/k)\mathcal{N}(0, 1/k) satisfies this with high probability.

Intuition

You can compress nn points to O(logn)O(\log n) dimensions while preserving all pairwise distances up to a multiplicative (1±ε)(1 \pm \varepsilon) factor. The target dimension depends on nn and ε\varepsilon, not on the original dimension dd. A random Gaussian matrix works: you do not need to look at the data.

Proof Sketch

For a fixed pair (xi,xj)(x_i, x_j), the projected distance A(xixj)2\|A(x_i - x_j)\|^2 is a sum of kk independent chi-squared random variables (after normalizing). By concentration (sub-exponential tail bounds), this sum stays within (1±ε)(1 \pm \varepsilon) of its mean with probability at least 12exp(cε2k)1 - 2\exp(-c\varepsilon^2 k). Set k=O(ε2logn)k = O(\varepsilon^{-2} \log n) and take a union bound over all (n2)\binom{n}{2} pairs.

Why It Matters

The JL lemma justifies random projections as a dimensionality reduction technique. It is data-independent (no training), fast (O(ndk)O(ndk) matrix-vector multiply, or O(ndlogk)O(nd \log k) with sparse/structured matrices), and provides provable guarantees. It is the theoretical backbone of locality sensitive hashing and compressed sensing.

Failure Mode

The JL lemma preserves pairwise distances, not cluster structure, manifold geometry, or class boundaries. The reduced dimension klognk \sim \log n can still be large for big datasets. Also, the constant in O(ε2logn)O(\varepsilon^{-2} \log n) matters: for n=106n = 10^6 and ε=0.1\varepsilon = 0.1, you need k1400k \sim 1400 dimensions, which may not be low enough for visualization.

Nonlinear Methods

t-SNE

Proposition

t-SNE Cost Function

Statement

t-SNE minimizes the KL divergence between the high-dimensional similarity distribution PP and the low-dimensional similarity distribution QQ:

C=DKL(PQ)=ijpijlogpijqijC = D_{\text{KL}}(P \| Q) = \sum_{i \neq j} p_{ij} \log \frac{p_{ij}}{q_{ij}}

where pij=exp(xixj2/2σi2)kiexp(xixk2/2σi2)p_{ij} = \frac{\exp(-\|x_i - x_j\|^2 / 2\sigma_i^2)}{\sum_{k \neq i} \exp(-\|x_i - x_k\|^2 / 2\sigma_i^2)} (symmetrized) in high dimensions and qij=(1+yiyj2)1kl(1+ykyl2)1q_{ij} = \frac{(1 + \|y_i - y_j\|^2)^{-1}}{\sum_{k \neq l} (1 + \|y_k - y_l\|^2)^{-1}} in low dimensions, using a Student-t kernel with one degree of freedom.

Intuition

t-SNE converts distances to probabilities. Nearby points in high-D get high pijp_{ij}; the algorithm finds low-D positions yiy_i where the qijq_{ij} match. The heavy-tailed Student-t kernel in low-D addresses the crowding problem: it allows moderate-distance points in high-D to be placed far apart in low-D without penalty, making room for clusters to separate.

Proof Sketch

The KL divergence DKL(PQ)D_{\text{KL}}(P \| Q) penalizes cases where pijp_{ij} is large but qijq_{ij} is small (nearby points in high-D placed far apart in low-D). It does not penalize the reverse as strongly (far points placed nearby). This asymmetry preserves local structure but not global distances.

Why It Matters

t-SNE produces visually striking 2D/3D embeddings that reveal cluster structure. It is the standard visualization tool for high-dimensional data in biology (single-cell RNA-seq), NLP (word embeddings), and computer vision.

Failure Mode

t-SNE does not preserve global structure: distances between clusters are meaningless. Different random initializations produce different layouts. Cluster sizes in the visualization do not reflect true cluster sizes. Perplexity (effectively, the neighborhood size) must be tuned, and different values can produce qualitatively different visualizations. t-SNE is for visualization, not for downstream ML tasks.

UMAP

UMAP (Uniform Manifold Approximation and Projection) builds a weighted kk-nearest-neighbor graph in high-D, then optimizes low-D positions to preserve the graph structure using a cross-entropy-like loss. UMAP tends to preserve global structure better than t-SNE (the graph-based approach maintains some inter-cluster relationships), runs faster on large datasets, and can embed into dimensions higher than 2 or 3 for downstream use.

The theoretical justification via Riemannian geometry and fuzzy simplicial sets is less straightforward than t-SNE's probabilistic framing, and recent work (Chari & Pachter, 2023) argues UMAP's theoretical claims are overstated. In practice, UMAP often produces useful embeddings, but treat it as a heuristic rather than a principled algorithm.

Kernel PCA

Apply PCA in a reproducing kernel Hilbert space (RKHS) defined by kernel k(xi,xj)k(x_i, x_j). This captures nonlinear structure while retaining the eigenvalue framework of PCA. The kernel matrix Kij=k(xi,xj)K_{ij} = k(x_i, x_j) serves as the Gram matrix; the top eigenvectors of the centered kernel matrix give the embedding. Common kernels: RBF (Gaussian), polynomial.

Autoencoders

Learn fenc:RdRkf_{\text{enc}}: \mathbb{R}^d \to \mathbb{R}^k and fdec:RkRdf_{\text{dec}}: \mathbb{R}^k \to \mathbb{R}^d by minimizing reconstruction error xfdec(fenc(x))2\|x - f_{\text{dec}}(f_{\text{enc}}(x))\|^2. With linear activations and k<dk < d, this recovers PCA. With nonlinear activations, it learns nonlinear manifolds. Variational autoencoders add a probabilistic structure to the latent space.

When to Use Which

MethodPreservesSpeedDownstream useDeterministic
PCAGlobal varianceFastYesYes
Random projectionPairwise distancesFastYesNo
t-SNELocal neighborhoodsSlowNo (visualization only)No
UMAPLocal + some globalModerateYes (with care)No
Kernel PCANonlinear kernel structureModerateYesYes
AutoencoderLearned reconstructionSlow (training)YesYes (given weights)

Common Confusions

Watch Out

t-SNE cluster distances are meaningless

Two clusters that are far apart in a t-SNE plot may or may not be far apart in the original space. The KL divergence objective prioritizes preserving local structure; global distances are sacrificed. Never interpret inter-cluster distance or relative cluster size from a t-SNE plot.

Watch Out

PCA components are not features

The first principal component is a linear combination of all original features. It is not a single feature. Interpreting PC1 as "the most important feature" is incorrect. You must examine the loadings (coefficients) to understand what each component represents.

Watch Out

Explained variance does not mean explained signal

PCA maximizes explained variance, but variance is not the same as task-relevant information. A high-variance direction could be noise. LDA (Linear Discriminant Analysis) maximizes class separability instead, which is more relevant for classification tasks.

Key Takeaways

  • The curse of dimensionality makes high-dimensional data sparse and distances meaningless; reduction is often necessary
  • PCA: linear, global, fast, preserves variance; optimal for reconstruction
  • JL lemma: random projections preserve pairwise distances in O(logn/ε2)O(\log n / \varepsilon^2) dimensions with no data dependence
  • t-SNE: nonlinear, local, for visualization only; do not interpret global structure
  • UMAP: faster than t-SNE, preserves more global structure, usable for downstream tasks
  • Choose the method based on what structure you need to preserve and whether the output is for visualization or for further computation

Exercises

ExerciseCore

Problem

You have n=10000n = 10000 points in R500\mathbb{R}^{500}. Using the Johnson-Lindenstrauss lemma with ε=0.1\varepsilon = 0.1, what is the minimum target dimension kk for random projection? Use k8ε2lnnk \geq 8 \varepsilon^{-2} \ln n.

ExerciseAdvanced

Problem

Explain why t-SNE with high perplexity (e.g., perplexity = n/3n/3) produces embeddings that look more like PCA, while low perplexity (e.g., perplexity = 5) produces tighter, more separated clusters. Connect this to the bandwidth σi\sigma_i in the Gaussian kernel.

Classical Manifold Learning Methods

These methods predate t-SNE and UMAP. They assume data lies on or near a smooth low-dimensional manifold embedded in high-dimensional space.

Isomap (Tenenbaum et al., 2000) approximates geodesic distances on the manifold using shortest paths in a k-nearest-neighbor graph, then applies classical multidimensional scaling (MDS) to the geodesic distance matrix. It preserves global geometry well when the manifold is convex (no "holes"). Failure mode: short-circuit edges in the graph (connecting points that are close in ambient space but far along the manifold) destroy the distance estimates.

Locally Linear Embedding (LLE) (Roweis & Saul, 2000) assumes each point can be reconstructed as a linear combination of its neighbors. It finds reconstruction weights WijW_{ij} by solving minixijWijxj2\min \sum_i \|x_i - \sum_j W_{ij} x_j\|^2 subject to jWij=1\sum_j W_{ij} = 1. Then it finds low-dimensional coordinates yiy_i that preserve these weights: miniyijWijyj2\min \sum_i \|y_i - \sum_j W_{ij} y_j\|^2. The second step is an eigenvalue problem. LLE captures local geometry but does not preserve global distances.

Laplacian Eigenmaps (Belkin & Niyogi, 2003) builds a weighted graph from the data (weights from a Gaussian kernel), computes the graph Laplacian L=DWL = D - W, and embeds using the smallest non-trivial eigenvectors of the generalized eigenvalue problem Ly=λDyLy = \lambda Dy. This is equivalent to finding coordinates that minimize ijWijyiyj2\sum_{ij} W_{ij} \|y_i - y_j\|^2, placing connected points close together. The connection to spectral clustering is direct: the same eigenvectors that give good embeddings also give good cluster assignments.

Diffusion Maps (Coifman & Lafon, 2006) use a random walk on the data graph to define a diffusion distance that captures the intrinsic geometry of the manifold. The diffusion distance between two points measures connectivity through many paths, not just the shortest path (unlike Isomap). This makes diffusion maps robust to noise and sampling density variation. The embedding coordinates are the top eigenvectors of the diffusion operator PtP^t, where tt controls the scale of the geometry captured.

MethodPreservesAssumesCostWeakness
PCAGlobal varianceLinear subspaceO(nd2)O(nd^2)Misses nonlinear structure
IsomapGeodesic distancesConvex manifoldO(n2logn)O(n^2 \log n)Short-circuit edges
LLELocal linear structureLocally linear manifoldO(n2)O(n^2)No global guarantees
Laplacian EigenmapsGraph connectivitySmooth manifoldO(n2)O(n^2)Scale-dependent
Diffusion MapsMulti-scale geometrySmooth manifoldO(n2)O(n^2)Bandwidth selection
t-SNELocal neighborhoodsNone (heuristic)O(nlogn)O(n \log n)Distances unreliable
UMAPLocal + some globalManifold (contested)O(nlogn)O(n \log n)Theory incomplete

Kernel PCA

Kernel PCA extends PCA to nonlinear feature spaces using the kernel trick. Instead of computing eigenvectors of the covariance matrix XX/nX^\top X / n, it computes eigenvectors of the kernel matrix Kij=k(xi,xj)K_{ij} = k(x_i, x_j).

The kernel matrix KK is the Gram matrix in the feature space ϕ(x)\phi(x): Kij=ϕ(xi),ϕ(xj)K_{ij} = \langle \phi(x_i), \phi(x_j) \rangle. The top eigenvectors of KK give the principal components in feature space, projected back to kernel evaluations. For a Gaussian kernel, this captures nonlinear structure that linear PCA misses.

Kernel PCA is exact (not heuristic) and has a clear connection to the RKHS theory. Its limitation is the O(n2)O(n^2) cost of computing and storing the kernel matrix, and the O(n3)O(n^3) cost of the eigendecomposition. For n>10,000n > 10{,}000, approximate methods (Nystrom, random features) are needed.

References

Canonical:

  • Johnson & Lindenstrauss, "Extensions of Lipschitz mappings into a Hilbert space" (1984)
  • van der Maaten & Hinton, "Visualizing Data using t-SNE" (JMLR 2008), Sections 2-3

Manifold learning:

  • Tenenbaum, de Silva, Langford, "A Global Geometric Framework for Nonlinear Dimensionality Reduction" (Science, 2000). Isomap.
  • Roweis & Saul, "Nonlinear Dimensionality Reduction by Locally Linear Embedding" (Science, 2000). LLE.
  • Belkin & Niyogi, "Laplacian Eigenmaps for Dimensionality Reduction and Data Representation" (Neural Computation, 2003)
  • Coifman & Lafon, "Diffusion Maps" (Applied and Computational Harmonic Analysis, 2006)
  • Scholkopf, Smola, Muller, "Kernel Principal Component Analysis" (Neural Computation, 1998)

Current:

  • McInnes et al., "UMAP: Uniform Manifold Approximation and Projection" (2018)
  • Vershynin, High-Dimensional Probability (2018), Chapter 5 (JL lemma)
  • Chari & Pachter, "The specious art of single-cell genomics" (2023)

Next Topics

  • Kernels and RKHS: the theoretical framework behind kernel PCA and nonlinear embeddings
  • Autoencoders: learning nonlinear dimensionality reduction with neural networks
  • Riemannian optimization: when the embedding space itself has manifold structure

Last reviewed: April 2026

Prerequisites

Foundations this topic depends on.

Next Topics