Dimensionality Reduction Theory

Sneiderman, Robby

ML Methods

Dimensionality Reduction Theory

Why and how to reduce dimensions: the curse of dimensionality, PCA, random projections (JL lemma), t-SNE, UMAP, and when each method preserves the structure you care about.

CoreTier 2StableSupporting~50 min

Prerequisites

Principal Component Analysis Eigenvalues and Eigenvectors Measure Concentration and Geometric Fa

Start 8-question practice · 5 available 3-question pulse check Prereq Map

Learning position

Read this page in the graph.

ml-methods | layer 2 | tier 2. This page has 3 direct prerequisites and 2 published dependents.

Open Atlas Prerequisites Leads to

What next

Kernels and Reproducing Kernel Hilbert Spaces

This is the first curated or graph-derived continuation from the current page.

Evidence badge

Claim status

This page has no public Lean mapping yet. Use the evidence page to inspect how claim status labels work.

Show the backing system

AtlasOpen the full prerequisite graph and run grounding traces.EvidenceInspect source support, claim labels, and public trust status.LeanReview the checked declaration list, scopes, and axiom profile.

Why This Matters

High-dimensional data creates three problems. First, distance metrics become meaningless: in high dimensions, the ratio of nearest-neighbor to farthest-neighbor distance converges to 1. Second, the volume of the space grows exponentially, so data becomes sparse and density estimation requires exponentially more samples. Third, computation scales with dimension. Dimensionality reduction addresses all three by mapping data to a lower-dimensional space while preserving relevant structure.

Infographic on dimensionality reduction: linear methods (PCA, LDA, factor analysis), kernel methods (kernel PCA), manifold methods (Isomap, LLE, t-SNE, UMAP), autoencoder-based methods, and the Johnson-Lindenstrauss lemma for distance-preserving random projections. Each with a tradeoff between linearity, interpretability, and ability to recover non-linear manifold structure. — Dimensionality reduction is a family of methods, not one method. Linearity, interpretability, and manifold-recovery quality trade off against each other.

The Curse of Dimensionality

Definition

Curse of Dimensionality

As the dimension $d$ increases while the number of samples $n$ stays fixed, data becomes increasingly sparse. For a unit hypercube in $\mathbb{R}^d$ , the fraction of volume within distance $\epsilon$ of the boundary is $1 - (1 - 2\epsilon)^d$ , which approaches 1 as $d \to \infty$ . Nearly all data points are near the boundary. Neighborhoods that are "local" in low dimensions become global in high dimensions.

Concretely: to maintain the same density of samples in a $d$ -dimensional space as in a 1-dimensional space with $n$ points, you need $n^d$ points. With 100 samples, a 10-dimensional space already feels empty.

Linear Methods

PCA (Review)

PCA projects onto the top $k$ eigenvectors of the covariance matrix, maximizing preserved variance. It is optimal among linear projections for reconstruction error (Eckart-Young theorem). PCA preserves global structure (large distances, principal directions of variance) but can miss nonlinear structure (curved manifolds).

Factor Analysis

Factor analysis models $x = Wz + \mu + \epsilon$ where $z \sim \mathcal{N}(0, I)$ is a low-dimensional latent and $\epsilon \sim \mathcal{N}(0, \Psi)$ with diagonal $\Psi$ . Unlike PCA, factor analysis separates signal variance ( $W$ ) from noise variance ( $\Psi$ ), making it more appropriate when features have different noise levels.

Random Projections

Lemma

Johnson-Lindenstrauss Lemma

Statement

For any set of $n$ points in $\mathbb{R}^d$ and any $\varepsilon \in (0, 1)$ , there exists a linear map $A: \mathbb{R}^d \to \mathbb{R}^k$ with $k = O(\varepsilon^{-2} \log n)$ such that for all pairs $i, j$ :

Moreover, a random matrix $A$ with entries drawn i.i.d. from $\mathcal{N}(0, 1/k)$ satisfies this with high probability.

Intuition

You can compress $n$ points to $O(\log n)$ dimensions while preserving all pairwise distances up to a multiplicative $(1 \pm \varepsilon)$ factor. The target dimension depends on $n$ and $\varepsilon$ , not on the original dimension $d$ . A random Gaussian matrix works: you do not need to look at the data.

Proof Sketch

For a fixed pair $(x_i, x_j)$ , the projected distance $\|A(x_i - x_j)\|^2$ is a sum of $k$ independent chi-squared random variables (after normalizing). By concentration (sub-exponential tail bounds), this sum stays within $(1 \pm \varepsilon)$ of its mean with probability at least $1 - 2\exp(-c\varepsilon^2 k)$ . Set $k = O(\varepsilon^{-2} \log n)$ and take a union bound over all $\binom{n}{2}$ pairs.

Why It Matters

The JL lemma justifies random projections as a dimensionality reduction technique. It is data-independent (no training), fast ( $O(ndk)$ matrix-vector multiply, or $O(nd \log k)$ with sparse/structured matrices), and provides provable guarantees. It is the theoretical backbone of locality sensitive hashing and compressed sensing.

Failure Mode

The JL lemma preserves pairwise distances, not cluster structure, manifold geometry, or class boundaries. The reduced dimension $k \sim \log n$ can still be large for big datasets. Also, the constant in $O(\varepsilon^{-2} \log n)$ matters: for $n = 10^6$ and $\varepsilon = 0.1$ , you need $k \sim 1400$ dimensions, which may not be low enough for visualization.

report a correction →

Nonlinear Methods

t-SNE

Proposition

t-SNE Cost Function

Statement

t-SNE minimizes the KL divergence between the high-dimensional similarity distribution $P$ and the low-dimensional similarity distribution $Q$ :

$C = D_{\text{KL}}(P \| Q) = \sum_{i \neq j} p_{ij} \log \frac{p_{ij}}{q_{ij}}$

where $p_{ij} = \frac{\exp(-\|x_i - x_j\|^2 / 2\sigma_i^2)}{\sum_{k \neq i} \exp(-\|x_i - x_k\|^2 / 2\sigma_i^2)}$ (symmetrized) in high dimensions and $q_{ij} = \frac{(1 + \|y_i - y_j\|^2)^{-1}}{\sum_{k \neq l} (1 + \|y_k - y_l\|^2)^{-1}}$ in low dimensions, using a Student-t kernel with one degree of freedom.

Intuition

t-SNE converts distances to probabilities. Nearby points in high-D get high $p_{ij}$ ; the algorithm finds low-D positions $y_i$ where the $q_{ij}$ match. The heavy-tailed Student-t kernel in low-D addresses the crowding problem: it allows moderate-distance points in high-D to be placed far apart in low-D without penalty, making room for clusters to separate.

Proof Sketch

The KL divergence $D_{\text{KL}}(P \| Q)$ penalizes cases where $p_{ij}$ is large but $q_{ij}$ is small (nearby points in high-D placed far apart in low-D). It does not penalize the reverse as strongly (far points placed nearby). This asymmetry preserves local structure but not global distances.

Why It Matters

t-SNE produces visually striking 2D/3D embeddings that reveal cluster structure. It is the standard visualization tool for high-dimensional data in biology (single-cell RNA-seq), NLP (word embeddings), and computer vision.

Failure Mode

t-SNE does not preserve global structure: distances between clusters are meaningless. Different random initializations produce different layouts. Cluster sizes in the visualization do not reflect true cluster sizes. Perplexity (effectively, the neighborhood size) must be tuned, and different values can produce qualitatively different visualizations. t-SNE is for visualization, not for downstream ML tasks.

report a correction →

UMAP

UMAP (Uniform Manifold Approximation and Projection) builds a weighted $k$ -nearest-neighbor graph in high-D, then optimizes low-D positions to preserve the graph structure using a cross-entropy-like loss. UMAP tends to preserve global structure better than t-SNE (the graph-based approach maintains some inter-cluster relationships), runs faster on large datasets, and can embed into dimensions higher than 2 or 3 for downstream use.

The theoretical justification via Riemannian geometry and fuzzy simplicial sets is less straightforward than t-SNE's probabilistic framing, and recent work (Chari & Pachter, 2023) argues UMAP's theoretical claims are overstated. In practice, UMAP often produces useful embeddings, but treat it as a heuristic rather than a principled algorithm.

Kernel PCA

Apply PCA in a reproducing kernel Hilbert space (RKHS) defined by kernel $k(x_i, x_j)$ . This captures nonlinear structure while retaining the eigenvalue framework of PCA. The kernel matrix $K_{ij} = k(x_i, x_j)$ serves as the Gram matrix; the top eigenvectors of the centered kernel matrix give the embedding. Common kernels: RBF (Gaussian), polynomial.

Autoencoders

Learn $f_{\text{enc}}: \mathbb{R}^d \to \mathbb{R}^k$ and $f_{\text{dec}}: \mathbb{R}^k \to \mathbb{R}^d$ by minimizing reconstruction error $\|x - f_{\text{dec}}(f_{\text{enc}}(x))\|^2$ . With linear activations and $k < d$ , this recovers PCA. With nonlinear activations, it learns nonlinear manifolds. Variational autoencoders add a probabilistic structure to the latent space.

When to Use Which

Method	Preserves	Speed	Downstream use	Deterministic
PCA	Global variance	Fast	Yes	Yes
Random projection	Pairwise distances	Fast	Yes	No
t-SNE	Local neighborhoods	Slow	No (visualization only)	No
UMAP	Local + some global	Moderate	Yes (with care)	No
Kernel PCA	Nonlinear kernel structure	Moderate	Yes	Yes
Autoencoder	Learned reconstruction	Slow (training)	Yes	Yes (given weights)

Common Confusions

Watch Out

t-SNE cluster distances are meaningless

Two clusters that are far apart in a t-SNE plot may or may not be far apart in the original space. The KL divergence objective prioritizes preserving local structure; global distances are sacrificed. Never interpret inter-cluster distance or relative cluster size from a t-SNE plot.

Watch Out

PCA components are not features

The first principal component is a linear combination of all original features. It is not a single feature. Interpreting PC1 as "the most important feature" is incorrect. You must examine the loadings (coefficients) to understand what each component represents.

Watch Out

Explained variance does not mean explained signal

PCA maximizes explained variance, but variance is not the same as task-relevant information. A high-variance direction could be noise. LDA (Linear Discriminant Analysis) maximizes class separability instead, which is more relevant for classification tasks.

Summary

The curse of dimensionality makes high-dimensional data sparse and distances meaningless; reduction is often necessary
PCA: linear, global, fast, preserves variance; optimal for reconstruction
JL lemma: random projections preserve pairwise distances in $O(\log n / \varepsilon^2)$ dimensions with no data dependence
t-SNE: nonlinear, local, for visualization only; do not interpret global structure
UMAP: faster than t-SNE, preserves more global structure, usable for downstream tasks
Choose the method based on what structure you need to preserve and whether the output is for visualization or for further computation

Exercises

ExerciseCore

Problem

You have $n = 10000$ points in $\mathbb{R}^{500}$ . Using the Johnson-Lindenstrauss lemma with $\varepsilon = 0.1$ , what is the minimum target dimension $k$ for random projection? Use $k \geq 8 \varepsilon^{-2} \ln n$ .

ExerciseAdvanced

Problem

Explain why t-SNE with high perplexity (e.g., perplexity = $n/3$ ) produces embeddings that look more like PCA, while low perplexity (e.g., perplexity = 5) produces tighter, more separated clusters. Connect this to the bandwidth $\sigma_i$ in the Gaussian kernel.

Classical Manifold Learning Methods

These methods predate t-SNE and UMAP. They assume data lies on or near a smooth low-dimensional manifold embedded in high-dimensional space.

Isomap (Tenenbaum et al., 2000) approximates geodesic distances on the manifold using shortest paths in a k-nearest-neighbor graph, then applies classical multidimensional scaling (MDS) to the geodesic distance matrix. It preserves global geometry well when the manifold is convex (no "holes"). Failure mode: short-circuit edges in the graph (connecting points that are close in ambient space but far along the manifold) destroy the distance estimates.

Locally Linear Embedding (LLE) (Roweis & Saul, 2000) assumes each point can be reconstructed as a linear combination of its neighbors. It finds reconstruction weights $W_{ij}$ by solving $\min \sum_i \|x_i - \sum_j W_{ij} x_j\|^2$ subject to $\sum_j W_{ij} = 1$ . Then it finds low-dimensional coordinates $y_i$ that preserve these weights: $\min \sum_i \|y_i - \sum_j W_{ij} y_j\|^2$ . The second step is an eigenvalue problem. LLE captures local geometry but does not preserve global distances.

Laplacian Eigenmaps (Belkin & Niyogi, 2003) builds a weighted graph from the data (weights from a Gaussian kernel), computes the graph Laplacian $L = D - W$ , and embeds using the smallest non-trivial eigenvectors of the generalized eigenvalue problem $Ly = \lambda Dy$ . This is equivalent to finding coordinates that minimize $\sum_{ij} W_{ij} \|y_i - y_j\|^2$ , placing connected points close together. The connection to spectral clustering is direct: the same eigenvectors that give good embeddings also give good cluster assignments.

Diffusion Maps (Coifman & Lafon, 2006) use a random walk on the data graph to define a diffusion distance that captures the intrinsic geometry of the manifold. The diffusion distance between two points measures connectivity through many paths, not just the shortest path (unlike Isomap). This makes diffusion maps robust to noise and sampling density variation. The embedding coordinates are the top eigenvectors of the diffusion operator $P^t$ , where $t$ controls the scale of the geometry captured.

Method	Preserves	Assumes	Cost	Weakness
PCA	Global variance	Linear subspace	$O(nd^2)$	Misses nonlinear structure
Isomap	Geodesic distances	Convex manifold	$O(n^2 \log n)$	Short-circuit edges
LLE	Local linear structure	Locally linear manifold	$O(n^2)$	No global guarantees
Laplacian Eigenmaps	Graph connectivity	Smooth manifold	$O(n^2)$	Scale-dependent
Diffusion Maps	Multi-scale geometry	Smooth manifold	$O(n^2)$	Bandwidth selection
t-SNE	Local neighborhoods	None (heuristic)	$O(n \log n)$	Distances unreliable
UMAP	Local + some global	Manifold (contested)	$O(n \log n)$	Theory incomplete

Kernel PCA

Kernel PCA extends PCA to nonlinear feature spaces using the kernel trick. Instead of computing eigenvectors of the covariance matrix $X^\top X / n$ , it computes eigenvectors of the kernel matrix $K_{ij} = k(x_i, x_j)$ .

The kernel matrix $K$ is the Gram matrix in the feature space $\phi(x)$ : $K_{ij} = \langle \phi(x_i), \phi(x_j) \rangle$ . The top eigenvectors of $K$ give the principal components in feature space, projected back to kernel evaluations. For a Gaussian kernel, this captures nonlinear structure that linear PCA misses.

Kernel PCA is exact (not heuristic) and has a clear connection to the RKHS theory. Its limitation is the $O(n^2)$ cost of computing and storing the kernel matrix, and the $O(n^3)$ cost of the eigendecomposition. For $n > 10{,}000$ , approximate methods (Nystrom, random features) are needed.

References

Canonical:

Johnson & Lindenstrauss, "Extensions of Lipschitz mappings into a Hilbert space" (1984)
van der Maaten & Hinton, "Visualizing Data using t-SNE" (JMLR 2008), Sections 2-3

Manifold learning:

Tenenbaum, de Silva, Langford, "A Global Geometric Framework for Nonlinear Dimensionality Reduction" (Science, 2000). Isomap.
Roweis & Saul, "Nonlinear Dimensionality Reduction by Locally Linear Embedding" (Science, 2000). LLE.
Belkin & Niyogi, "Laplacian Eigenmaps for Dimensionality Reduction and Data Representation" (Neural Computation, 2003)
Coifman & Lafon, "Diffusion Maps" (Applied and Computational Harmonic Analysis, 2006)
Scholkopf, Smola, Muller, "Kernel Principal Component Analysis" (Neural Computation, 1998)

Current:

McInnes et al., "UMAP: Uniform Manifold Approximation and Projection" (2018)
Vershynin, High-Dimensional Probability (2018), Chapter 5 (JL lemma)
Chari & Pachter, "The specious art of single-cell genomics" (2023)

Next Topics

Kernels and RKHS: the theoretical framework behind kernel PCA and nonlinear embeddings
Autoencoders: learning nonlinear dimensionality reduction with neural networks
Riemannian optimization: when the embedding space itself has manifold structure

Last reviewed: April 14, 2026

Canonical graph

Required before and derived from this topic

These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.

Full prerequisite chain All derived topics

Required prerequisites

3

Eigenvalues and Eigenvectorslayer 0A · tier 1
Principal Component Analysislayer 1 · tier 1
Measure Concentration and Geometric Functional Analysislayer 3 · tier 1

Derived topics

2

Autoencoderslayer 2 · tier 2
Kernels and Reproducing Kernel Hilbert Spaceslayer 3 · tier 2

Graph-backed continuations

Kernels and Reproducing Kernel Hilbert Spaces Autoencoders