ML Methods
t-SNE and UMAP
Two dominant nonlinear dimensionality reduction methods: t-SNE preserves local neighborhoods via KL divergence with a Student-t kernel, UMAP uses fuzzy simplicial sets and cross-entropy. Both excel at visualization but have important limitations.
Prerequisites
Why This Matters
When you have high-dimensional data --- gene expression profiles, word embeddings, neural network activations --- you need to visualize it. PCA projects linearly and misses curved structure. t-SNE and UMAP are the standard tools for nonlinear dimensionality reduction that reveals cluster structure invisible to linear methods.
But these methods are widely misused. People read global structure from plots that only preserve local structure, interpret cluster distances that have no meaning, and draw conclusions from artifacts of hyperparameter choices. Understanding the theory prevents these mistakes.
Mental Model
Both t-SNE and UMAP work by the same high-level strategy:
- In high-dimensional space, define a probability distribution over pairs of points that captures "which points are neighbors"
- In low-dimensional space (2D or 3D), define a similar distribution
- Optimize the low-dimensional embedding to make the two distributions match
The methods differ in how they define these distributions and what divergence they minimize between them.
Formal Setup and Notation
Let be high-dimensional data points. We seek a low-dimensional embedding (typically ) that preserves neighborhood structure.
t-SNE High-Dimensional Similarities
For each pair , define the conditional probability that would pick as a neighbor:
where is chosen so that the effective number of neighbors (perplexity) matches a user-specified value. The joint probability is symmetrized:
t-SNE Low-Dimensional Similarities
In the low-dimensional space, similarities use a Student-t distribution with one degree of freedom (Cauchy distribution):
The heavy tails of the Student-t kernel are the key innovation of t-SNE over its predecessor SNE, which used Gaussians in both spaces.
t-SNE Objective
Minimize the KL divergence from the high-dimensional distribution to the low-dimensional distribution :
This is optimized via gradient descent on the positions .
Core Definitions
The perplexity parameter controls the effective number of neighbors. It is related to the bandwidth via:
where is the Shannon entropy of the conditional distribution centered at . Typical values are 5 to 50. Larger perplexity considers more neighbors, producing smoother embeddings.
The crowding problem is the fundamental challenge that t-SNE solves: in high dimensions, a point can have many equidistant neighbors, but in 2D there is not enough room to arrange them all at the same distance. Moderate-distance points in high-D get "crowded" into a small region in 2D, crushing the structure.
Main Theorems
Heavy Tails Solve the Crowding Problem
Statement
Using a Gaussian kernel in the low-dimensional space causes moderate-distance pairs to collapse onto nearby pairs (crowding). Using a Student-t kernel solves this by allowing moderate-distance pairs to be placed much farther apart in the embedding without paying a large penalty.
Specifically, for a pair at moderate distance in high-D, the Student-t kernel assigns non-negligible probability even when is large, because the heavy tail decays as rather than exponentially.
Intuition
The Gaussian kernel in high-D says "these two points are somewhat similar." To match this probability with a Gaussian in low-D, you must place the points close together. But with a Student-t kernel in low-D, you can place them farther apart and still match the probability, because the heavy tail provides probability mass at larger distances. This gives the embedding room to "spread out" and separate clusters, while keeping genuinely close neighbors together.
UMAP: A Different Foundation
UMAP (Uniform Manifold Approximation and Projection) achieves similar results through different mathematics.
UMAP as Fuzzy Simplicial Set Matching
Statement
UMAP constructs a weighted graph (fuzzy simplicial set) in high dimensions by assigning edge weights:
where is the distance to the nearest neighbor of and is chosen to achieve a target number of neighbors. The graph is symmetrized via .
In the low-dimensional space, edge weights are:
where are parameters fit to approximate a target smooth distribution.
UMAP minimizes the fuzzy set cross-entropy between the two graphs:
Intuition
UMAP's cross-entropy objective has two terms. The first (attractive) pulls together points that are neighbors in high-D. The second (repulsive) pushes apart points that are not neighbors. t-SNE's KL divergence only has the attractive term explicitly --- the repulsive force comes implicitly from the normalization of . UMAP's explicit repulsion makes optimization more efficient and avoids the global normalization that makes t-SNE slow.
t-SNE vs. UMAP: Practical Differences
| t-SNE | UMAP | |
|---|---|---|
| Objective | KL divergence | Cross-entropy of fuzzy sets |
| Low-D kernel | Cauchy | |
| Normalization | Global (over all pairs) | Local (no global normalization) |
| Speed | or with Barnes-Hut | empirically |
| Scalability | Slow for | Handles millions of points |
| Global structure | Poorly preserved | Slightly better preserved |
| Hyperparameters | Perplexity, learning rate, iterations | n_neighbors, min_dist, n_components |
| Determinism | Stochastic (random initialization) | Stochastic (spectral initialization helps) |
Canonical Examples
Visualizing MNIST digits
Both t-SNE and UMAP applied to the 784-dimensional MNIST pixel vectors reveal 10 well-separated clusters corresponding to the 10 digit classes. Points within each cluster group by writing style. This is the canonical demo for both methods, showing that they successfully uncover the manifold structure that PCA misses.
Single-cell RNA-seq analysis
In computational biology, UMAP is the standard tool for visualizing single-cell transcriptomics data (20,000+ gene dimensions). Cells cluster by type, and differentiation trajectories appear as continuous paths. UMAP is preferred over t-SNE here because of its speed and ability to embed millions of cells.
Common Confusions
Distances between clusters are meaningless
Both t-SNE and UMAP preserve local structure (which points are neighbors) but not global structure (how far apart clusters are). Two clusters that appear far apart in a t-SNE plot may actually be close in the original space, and vice versa. Never interpret inter-cluster distances from these visualizations.
Cluster sizes are not meaningful
The apparent size of clusters in t-SNE and UMAP plots depends on the density of points in the original space and the hyperparameters. A large cluster in the plot does not necessarily correspond to a large or dispersed cluster in the data. Comparing cluster sizes across a single plot is unreliable.
Different runs give different results
Both methods depend on random initialization. Different random seeds can produce visually different plots of the same data. The overall cluster structure is usually preserved, but the relative positions and orientations of clusters can change. Always run the algorithm multiple times to verify that observed patterns are consistent.
Perplexity and n_neighbors change the story
Low perplexity (or few neighbors) emphasizes very local structure, potentially fragmenting real clusters. High perplexity (or many neighbors) washes out fine-grained structure. There is no "correct" value, and the plot can look dramatically different for different settings. Always explore a range of values before drawing conclusions.
UMAP's theoretical foundations are contested
UMAP was originally justified using fuzzy simplicial set theory and Riemannian geometry. Chari and Pachter (2023) showed that the claimed connection between the algorithm and the topological theory is weaker than presented. Specifically, the "fuzzy simplicial set" functor does not behave as described in the original paper, and the cross-entropy loss UMAP actually optimizes does not correspond to a principled topological objective. UMAP works well as a visualization tool, but treat the theoretical justification with skepticism. t-SNE is more honest about what it is: a heuristic that optimizes a specific KL-divergence objective. Use both as visual aids, not as evidence about data structure.
Summary
- Both methods define similarities in high-D and low-D, then minimize a divergence between them
- t-SNE: Gaussian in high-D, Student-t in low-D, KL divergence
- Heavy-tailed low-D kernel solves the crowding problem
- UMAP: fuzzy simplicial sets, cross-entropy, faster and more scalable
- Local structure is preserved; global structure is not
- Hyperparameters (perplexity, n_neighbors) strongly affect the output
- Never interpret inter-cluster distances or cluster sizes literally
Exercises
Problem
Explain why using a Gaussian kernel in both high and low dimensions (as in the original SNE) leads to the crowding problem. What property of the Student-t kernel fixes this?
Problem
Why is UMAP faster than t-SNE? Identify the computational bottleneck in t-SNE that UMAP avoids.
Problem
A colleague claims that two clusters being far apart in a UMAP plot means the corresponding data groups are very different. Construct a counterexample and explain why this interpretation is invalid.
References
Canonical:
- van der Maaten & Hinton, Visualizing Data using t-SNE (JMLR 2008)
- McInnes, Healy, Melville, UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction (2018)
Current:
-
Kobak & Berens, The Art of Using t-SNE for Single-Cell Transcriptomics (Nature Communications 2019)
-
Bohm et al., Attraction-Repulsion Spectrum in Neighbor Embeddings (JMLR 2022)
-
Hastie, Tibshirani, Friedman, The Elements of Statistical Learning (2009), Chapters 3-15
-
Bishop, Pattern Recognition and Machine Learning (2006), Chapters 1-14
Next Topics
The natural next steps from t-SNE and UMAP:
- Autoencoders: learned nonlinear dimensionality reduction
- K-means clustering: what to do after you see clusters in the embedding
Last reviewed: April 2026
Prerequisites
Foundations this topic depends on.
- Principal Component AnalysisLayer 1
- Eigenvalues and EigenvectorsLayer 0A
- Matrix Operations and PropertiesLayer 0A
- Sets, Functions, and RelationsLayer 0A
- Basic Logic and Proof TechniquesLayer 0A
- Singular Value DecompositionLayer 0A