Beta. Content is under active construction and has not been peer-reviewed. Report errors on GitHub.Disclaimer

ML Methods

t-SNE and UMAP

Two dominant nonlinear dimensionality reduction methods: t-SNE preserves local neighborhoods via KL divergence with a Student-t kernel, UMAP uses fuzzy simplicial sets and cross-entropy. Both excel at visualization but have important limitations.

CoreTier 2Stable~45 min
0

Why This Matters

When you have high-dimensional data --- gene expression profiles, word embeddings, neural network activations --- you need to visualize it. PCA projects linearly and misses curved structure. t-SNE and UMAP are the standard tools for nonlinear dimensionality reduction that reveals cluster structure invisible to linear methods.

But these methods are widely misused. People read global structure from plots that only preserve local structure, interpret cluster distances that have no meaning, and draw conclusions from artifacts of hyperparameter choices. Understanding the theory prevents these mistakes.

Mental Model

Both t-SNE and UMAP work by the same high-level strategy:

  1. In high-dimensional space, define a probability distribution over pairs of points that captures "which points are neighbors"
  2. In low-dimensional space (2D or 3D), define a similar distribution
  3. Optimize the low-dimensional embedding to make the two distributions match

The methods differ in how they define these distributions and what divergence they minimize between them.

Formal Setup and Notation

Let {x1,,xn}RD\{x_1, \ldots, x_n\} \subset \mathbb{R}^D be high-dimensional data points. We seek a low-dimensional embedding {y1,,yn}Rd\{y_1, \ldots, y_n\} \subset \mathbb{R}^d (typically d=2d = 2) that preserves neighborhood structure.

Definition

t-SNE High-Dimensional Similarities

For each pair (i,j)(i, j), define the conditional probability that xix_i would pick xjx_j as a neighbor:

pji=exp(xixj2/2σi2)kiexp(xixk2/2σi2)p_{j|i} = \frac{\exp(-\|x_i - x_j\|^2 / 2\sigma_i^2)}{\sum_{k \neq i} \exp(-\|x_i - x_k\|^2 / 2\sigma_i^2)}

where σi\sigma_i is chosen so that the effective number of neighbors (perplexity) matches a user-specified value. The joint probability is symmetrized:

pij=pji+pij2np_{ij} = \frac{p_{j|i} + p_{i|j}}{2n}

Definition

t-SNE Low-Dimensional Similarities

In the low-dimensional space, similarities use a Student-t distribution with one degree of freedom (Cauchy distribution):

qij=(1+yiyj2)1kl(1+ykyl2)1q_{ij} = \frac{(1 + \|y_i - y_j\|^2)^{-1}}{\sum_{k \neq l}(1 + \|y_k - y_l\|^2)^{-1}}

The heavy tails of the Student-t kernel are the key innovation of t-SNE over its predecessor SNE, which used Gaussians in both spaces.

Definition

t-SNE Objective

Minimize the KL divergence from the high-dimensional distribution PP to the low-dimensional distribution QQ:

C=KL(PQ)=ijpijlogpijqijC = \text{KL}(P \| Q) = \sum_{i \neq j} p_{ij} \log \frac{p_{ij}}{q_{ij}}

This is optimized via gradient descent on the positions {yi}\{y_i\}.

Core Definitions

The perplexity parameter controls the effective number of neighbors. It is related to the bandwidth σi\sigma_i via:

Perp(Pi)=2H(Pi)\text{Perp}(P_i) = 2^{H(P_i)}

where H(Pi)=jpjilog2pjiH(P_i) = -\sum_j p_{j|i}\log_2 p_{j|i} is the Shannon entropy of the conditional distribution centered at xix_i. Typical values are 5 to 50. Larger perplexity considers more neighbors, producing smoother embeddings.

The crowding problem is the fundamental challenge that t-SNE solves: in high dimensions, a point can have many equidistant neighbors, but in 2D there is not enough room to arrange them all at the same distance. Moderate-distance points in high-D get "crowded" into a small region in 2D, crushing the structure.

Main Theorems

Proposition

Heavy Tails Solve the Crowding Problem

Statement

Using a Gaussian kernel qijexp(yiyj2)q_{ij} \propto \exp(-\|y_i - y_j\|^2) in the low-dimensional space causes moderate-distance pairs to collapse onto nearby pairs (crowding). Using a Student-t kernel qij(1+yiyj2)1q_{ij} \propto (1 + \|y_i - y_j\|^2)^{-1} solves this by allowing moderate-distance pairs to be placed much farther apart in the embedding without paying a large penalty.

Specifically, for a pair at moderate distance in high-D, the Student-t kernel assigns non-negligible probability qijq_{ij} even when yiyj\|y_i - y_j\| is large, because the heavy tail decays as yiyj2\|y_i - y_j\|^{-2} rather than exponentially.

Intuition

The Gaussian kernel in high-D says "these two points are somewhat similar." To match this probability with a Gaussian in low-D, you must place the points close together. But with a Student-t kernel in low-D, you can place them farther apart and still match the probability, because the heavy tail provides probability mass at larger distances. This gives the embedding room to "spread out" and separate clusters, while keeping genuinely close neighbors together.

UMAP: A Different Foundation

UMAP (Uniform Manifold Approximation and Projection) achieves similar results through different mathematics.

Proposition

UMAP as Fuzzy Simplicial Set Matching

Statement

UMAP constructs a weighted graph (fuzzy simplicial set) in high dimensions by assigning edge weights:

wij=exp ⁣(d(xi,xj)ρiσi)w_{ij} = \exp\!\left(-\frac{d(x_i, x_j) - \rho_i}{\sigma_i}\right)

where ρi\rho_i is the distance to the nearest neighbor of xix_i and σi\sigma_i is chosen to achieve a target number of neighbors. The graph is symmetrized via wijsym=wij+wjiwijwjiw_{ij}^{\text{sym}} = w_{ij} + w_{ji} - w_{ij}\cdot w_{ji}.

In the low-dimensional space, edge weights are:

vij=(1+ayiyj2b)1v_{ij} = (1 + a\|y_i - y_j\|^{2b})^{-1}

where a,ba, b are parameters fit to approximate a target smooth distribution.

UMAP minimizes the fuzzy set cross-entropy between the two graphs:

C=ij ⁣[wijlogwijvij+(1wij)log1wij1vij]C = \sum_{i \neq j}\!\left[w_{ij}\log\frac{w_{ij}}{v_{ij}} + (1 - w_{ij})\log\frac{1 - w_{ij}}{1 - v_{ij}}\right]

Intuition

UMAP's cross-entropy objective has two terms. The first (attractive) pulls together points that are neighbors in high-D. The second (repulsive) pushes apart points that are not neighbors. t-SNE's KL divergence only has the attractive term explicitly --- the repulsive force comes implicitly from the normalization of qijq_{ij}. UMAP's explicit repulsion makes optimization more efficient and avoids the global normalization that makes t-SNE slow.

t-SNE vs. UMAP: Practical Differences

t-SNEUMAP
ObjectiveKL divergence KL(PQ)\text{KL}(P \| Q)Cross-entropy of fuzzy sets
Low-D kernelCauchy (1+d2)1(1 + d^2)^{-1}(1+ad2b)1(1 + ad^{2b})^{-1}
NormalizationGlobal (over all pairs)Local (no global normalization)
SpeedO(n2)O(n^2) or O(nlogn)O(n \log n) with Barnes-HutO(n1.14)O(n^{1.14}) empirically
ScalabilitySlow for n>50,000n > 50{,}000Handles millions of points
Global structurePoorly preservedSlightly better preserved
HyperparametersPerplexity, learning rate, iterationsn_neighbors, min_dist, n_components
DeterminismStochastic (random initialization)Stochastic (spectral initialization helps)

Canonical Examples

Example

Visualizing MNIST digits

Both t-SNE and UMAP applied to the 784-dimensional MNIST pixel vectors reveal 10 well-separated clusters corresponding to the 10 digit classes. Points within each cluster group by writing style. This is the canonical demo for both methods, showing that they successfully uncover the manifold structure that PCA misses.

Example

Single-cell RNA-seq analysis

In computational biology, UMAP is the standard tool for visualizing single-cell transcriptomics data (20,000+ gene dimensions). Cells cluster by type, and differentiation trajectories appear as continuous paths. UMAP is preferred over t-SNE here because of its speed and ability to embed millions of cells.

Common Confusions

Watch Out

Distances between clusters are meaningless

Both t-SNE and UMAP preserve local structure (which points are neighbors) but not global structure (how far apart clusters are). Two clusters that appear far apart in a t-SNE plot may actually be close in the original space, and vice versa. Never interpret inter-cluster distances from these visualizations.

Watch Out

Cluster sizes are not meaningful

The apparent size of clusters in t-SNE and UMAP plots depends on the density of points in the original space and the hyperparameters. A large cluster in the plot does not necessarily correspond to a large or dispersed cluster in the data. Comparing cluster sizes across a single plot is unreliable.

Watch Out

Different runs give different results

Both methods depend on random initialization. Different random seeds can produce visually different plots of the same data. The overall cluster structure is usually preserved, but the relative positions and orientations of clusters can change. Always run the algorithm multiple times to verify that observed patterns are consistent.

Watch Out

Perplexity and n_neighbors change the story

Low perplexity (or few neighbors) emphasizes very local structure, potentially fragmenting real clusters. High perplexity (or many neighbors) washes out fine-grained structure. There is no "correct" value, and the plot can look dramatically different for different settings. Always explore a range of values before drawing conclusions.

Watch Out

UMAP's theoretical foundations are contested

UMAP was originally justified using fuzzy simplicial set theory and Riemannian geometry. Chari and Pachter (2023) showed that the claimed connection between the algorithm and the topological theory is weaker than presented. Specifically, the "fuzzy simplicial set" functor does not behave as described in the original paper, and the cross-entropy loss UMAP actually optimizes does not correspond to a principled topological objective. UMAP works well as a visualization tool, but treat the theoretical justification with skepticism. t-SNE is more honest about what it is: a heuristic that optimizes a specific KL-divergence objective. Use both as visual aids, not as evidence about data structure.

Summary

  • Both methods define similarities in high-D and low-D, then minimize a divergence between them
  • t-SNE: Gaussian in high-D, Student-t in low-D, KL divergence
  • Heavy-tailed low-D kernel solves the crowding problem
  • UMAP: fuzzy simplicial sets, cross-entropy, faster and more scalable
  • Local structure is preserved; global structure is not
  • Hyperparameters (perplexity, n_neighbors) strongly affect the output
  • Never interpret inter-cluster distances or cluster sizes literally

Exercises

ExerciseCore

Problem

Explain why using a Gaussian kernel in both high and low dimensions (as in the original SNE) leads to the crowding problem. What property of the Student-t kernel fixes this?

ExerciseAdvanced

Problem

Why is UMAP faster than t-SNE? Identify the computational bottleneck in t-SNE that UMAP avoids.

ExerciseResearch

Problem

A colleague claims that two clusters being far apart in a UMAP plot means the corresponding data groups are very different. Construct a counterexample and explain why this interpretation is invalid.

References

Canonical:

  • van der Maaten & Hinton, Visualizing Data using t-SNE (JMLR 2008)
  • McInnes, Healy, Melville, UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction (2018)

Current:

  • Kobak & Berens, The Art of Using t-SNE for Single-Cell Transcriptomics (Nature Communications 2019)

  • Bohm et al., Attraction-Repulsion Spectrum in Neighbor Embeddings (JMLR 2022)

  • Hastie, Tibshirani, Friedman, The Elements of Statistical Learning (2009), Chapters 3-15

  • Bishop, Pattern Recognition and Machine Learning (2006), Chapters 1-14

Next Topics

The natural next steps from t-SNE and UMAP:

Last reviewed: April 2026

Prerequisites

Foundations this topic depends on.

Next Topics