t-SNE and UMAP

Sneiderman, Robby

ML Methods

t-SNE and UMAP

Two dominant nonlinear dimensionality reduction methods: t-SNE preserves local neighborhoods via KL divergence with a Student-t kernel, UMAP uses fuzzy simplicial sets and cross-entropy. Both excel at visualization but have important limitations.

CoreTier 2StableSupporting~45 min

Prerequisites

Principal Component Analysis Self Organizing Maps

Start 8-question practice · 1 available Prereq Map

Learning position

Read this page in the graph.

ml-methods | layer 2 | tier 2. This page has 2 direct prerequisites and 3 published dependents.

Open Atlas Prerequisites Leads to

What next

Autoencoders

This is the first curated or graph-derived continuation from the current page.

Evidence badge

Claim status

This page has no public Lean mapping yet. Use the evidence page to inspect how claim status labels work.

Show the backing system

AtlasOpen the full prerequisite graph and run grounding traces.EvidenceInspect source support, claim labels, and public trust status.LeanReview the checked declaration list, scopes, and axiom profile.

Why This Matters

When you have high-dimensional data --- gene expression profiles, word embeddings, neural network activations --- you need to visualize it. PCA projects linearly and misses curved structure. t-SNE and UMAP are the standard tools for nonlinear dimensionality reduction that reveals cluster structure invisible to linear methods.

But these methods are widely misused. People read global structure from plots that only preserve local structure, interpret cluster distances that have no meaning, and draw conclusions from artifacts of hyperparameter choices. Understanding the theory prevents these mistakes.

Mental Model

Both t-SNE and UMAP work by the same high-level strategy:

In high-dimensional space, define a probability distribution over pairs of points that captures "which points are neighbors"
In low-dimensional space (2D or 3D), define a similar distribution
Optimize the low-dimensional embedding to make the two distributions match

The methods differ in how they define these distributions and what divergence they minimize between them.

Formal Setup and Notation

Let $\{x_1, \ldots, x_n\} \subset \mathbb{R}^D$ be high-dimensional data points. We seek a low-dimensional embedding $\{y_1, \ldots, y_n\} \subset \mathbb{R}^d$ (typically $d = 2$ ) that preserves neighborhood structure.

Definition

t-SNE High-Dimensional Similarities

For each pair $(i, j)$ , define the conditional probability that $x_i$ would pick $x_j$ as a neighbor:

$p_{j|i} = \frac{\exp(-\|x_i - x_j\|^2 / 2\sigma_i^2)}{\sum_{k \neq i} \exp(-\|x_i - x_k\|^2 / 2\sigma_i^2)}$

where $\sigma_i$ is chosen so that the effective number of neighbors (perplexity) matches a user-specified value. The joint probability is symmetrized:

$p_{ij} = \frac{p_{j|i} + p_{i|j}}{2n}$

Definition

t-SNE Low-Dimensional Similarities

In the low-dimensional space, similarities use a Student-t distribution with one degree of freedom (Cauchy distribution):

$q_{ij} = \frac{(1 + \|y_i - y_j\|^2)^{-1}}{\sum_{k \neq l}(1 + \|y_k - y_l\|^2)^{-1}}$

The heavy tails of the Student-t kernel are the key innovation of t-SNE over its predecessor SNE, which used Gaussians in both spaces.

Definition

t-SNE Objective

Minimize the KL divergence from the high-dimensional distribution $P$ to the low-dimensional distribution $Q$ :

$C = \text{KL}(P \| Q) = \sum_{i \neq j} p_{ij} \log \frac{p_{ij}}{q_{ij}}$

This is optimized via gradient descent on the positions $\{y_i\}$ .

Core Definitions

The perplexity parameter controls the effective number of neighbors. It is related to the bandwidth $\sigma_i$ via:

$\text{Perp}(P_i) = 2^{H(P_i)}$

where $H(P_i) = -\sum_j p_{j|i}\log_2 p_{j|i}$ is the Shannon entropy of the conditional distribution centered at $x_i$ . Typical values are 5 to 50. Larger perplexity considers more neighbors, producing smoother embeddings.

The crowding problem is the fundamental challenge that t-SNE solves: in high dimensions, a point can have many equidistant neighbors, but in 2D there is not enough room to arrange them all at the same distance. Moderate-distance points in high-D get "crowded" into a small region in 2D, crushing the structure.

Main Theorems

Proposition

Heavy Tails Solve the Crowding Problem

Statement

Using a Gaussian kernel $q_{ij} \propto \exp(-\|y_i - y_j\|^2)$ in the low-dimensional space causes moderate-distance pairs to collapse onto nearby pairs (crowding). Using a Student-t kernel $q_{ij} \propto (1 + \|y_i - y_j\|^2)^{-1}$ solves this by allowing moderate-distance pairs to be placed much farther apart in the embedding without paying a large penalty.

Specifically, for a pair at moderate distance in high-D, the Student-t kernel assigns non-negligible probability $q_{ij}$ even when $\|y_i - y_j\|$ is large, because the heavy tail decays as $\|y_i - y_j\|^{-2}$ rather than exponentially.

Intuition

The Gaussian kernel in high-D says "these two points are somewhat similar." To match this probability with a Gaussian in low-D, you must place the points close together. But with a Student-t kernel in low-D, you can place them farther apart and still match the probability, because the heavy tail provides probability mass at larger distances. This gives the embedding room to "spread out" and separate clusters, while keeping genuinely close neighbors together.

report a correction →

UMAP: A Different Foundation

UMAP (Uniform Manifold Approximation and Projection) achieves similar results through different mathematics.

Proposition

UMAP as Fuzzy Simplicial Set Matching

Statement

UMAP constructs a weighted graph (fuzzy simplicial set) in high dimensions by assigning edge weights:

$w_{ij} = \exp\!\left(-\frac{d(x_i, x_j) - \rho_i}{\sigma_i}\right)$

where $\rho_i$ is the distance to the nearest neighbor of $x_i$ and $\sigma_i$ is chosen to achieve a target number of neighbors. The graph is symmetrized via $w_{ij}^{\text{sym}} = w_{ij} + w_{ji} - w_{ij}\cdot w_{ji}$ .

In the low-dimensional space, edge weights are:

$v_{ij} = (1 + a\|y_i - y_j\|^{2b})^{-1}$

where $a, b$ are parameters fit to approximate a target smooth distribution.

UMAP minimizes the fuzzy set cross-entropy between the two graphs:

$C = \sum_{i \neq j}\!\left[w_{ij}\log\frac{w_{ij}}{v_{ij}} + (1 - w_{ij})\log\frac{1 - w_{ij}}{1 - v_{ij}}\right]$

Intuition

UMAP's cross-entropy objective has two terms. The first (attractive) pulls together points that are neighbors in high-D. The second (repulsive) pushes apart points that are not neighbors. t-SNE's KL divergence only has the attractive term explicitly --- the repulsive force comes implicitly from the normalization of $q_{ij}$ . UMAP's explicit repulsion makes optimization more efficient and avoids the global normalization that makes t-SNE slow.

report a correction →

t-SNE vs. UMAP: Practical Differences

	t-SNE	UMAP
Objective	KL divergence $\text{KL}(P \\| Q)$	Cross-entropy of fuzzy sets
Low-D kernel	Cauchy $(1 + d^2)^{-1}$	$(1 + ad^{2b})^{-1}$
Normalization	Global (over all pairs)	Local (no global normalization)
Speed	$O(n^2)$ or $O(n \log n)$ with Barnes-Hut	$O(n^{1.14})$ empirically
Scalability	Slow for $n > 50{,}000$	Handles millions of points
Global structure	Poorly preserved	Slightly better preserved
Hyperparameters	Perplexity, learning rate, iterations	n_neighbors, min_dist, n_components
Determinism	Stochastic (random initialization)	Stochastic (spectral initialization helps)

Canonical Examples

Example

Visualizing MNIST digits

Both t-SNE and UMAP applied to the 784-dimensional MNIST pixel vectors reveal 10 well-separated clusters corresponding to the 10 digit classes. Points within each cluster group by writing style. This is the canonical demo for both methods, showing that they successfully uncover the manifold structure that PCA misses.

Example

Single-cell RNA-seq analysis

In computational biology, UMAP is the standard tool for visualizing single-cell transcriptomics data (20,000+ gene dimensions). Cells cluster by type, and differentiation trajectories appear as continuous paths. UMAP is preferred over t-SNE here because of its speed and ability to embed millions of cells.

Common Confusions

Watch Out

Distances between clusters are meaningless

Both t-SNE and UMAP preserve local structure (which points are neighbors) but not global structure (how far apart clusters are). Two clusters that appear far apart in a t-SNE plot may actually be close in the original space, and vice versa. Never interpret inter-cluster distances from these visualizations.

Watch Out

Cluster sizes are not meaningful

The apparent size of clusters in t-SNE and UMAP plots depends on the density of points in the original space and the hyperparameters. A large cluster in the plot does not necessarily correspond to a large or dispersed cluster in the data. Comparing cluster sizes across a single plot is unreliable.

Watch Out

Different runs give different results

Both methods depend on random initialization. Different random seeds can produce visually different plots of the same data. The overall cluster structure is usually preserved, but the relative positions and orientations of clusters can change. Always run the algorithm multiple times to verify that observed patterns are consistent.

Watch Out

Perplexity and n_neighbors change the story

Low perplexity (or few neighbors) emphasizes very local structure, potentially fragmenting real clusters. High perplexity (or many neighbors) washes out fine-grained structure. There is no "correct" value, and the plot can look dramatically different for different settings. Always explore a range of values before drawing conclusions.

Watch Out

UMAP's theoretical foundations are contested

UMAP was originally justified using fuzzy simplicial set theory and Riemannian geometry. Chari and Pachter (2023) showed that the claimed connection between the algorithm and the topological theory is weaker than presented. Specifically, the "fuzzy simplicial set" functor does not behave as described in the original paper, and the cross-entropy loss UMAP actually optimizes does not correspond to a principled topological objective. UMAP works well as a visualization tool, but treat the theoretical justification with skepticism. t-SNE is more honest about what it is: a heuristic that optimizes a specific KL-divergence objective. Use both as visual aids, not as evidence about data structure.

Summary

Both methods define similarities in high-D and low-D, then minimize a divergence between them
t-SNE: Gaussian in high-D, Student-t in low-D, KL divergence
Heavy-tailed low-D kernel solves the crowding problem
UMAP: fuzzy simplicial sets, cross-entropy, faster and more scalable
Local structure is preserved; global structure is not
Hyperparameters (perplexity, n_neighbors) strongly affect the output
Never interpret inter-cluster distances or cluster sizes literally

Exercises

ExerciseCore

Problem

Explain why using a Gaussian kernel in both high and low dimensions (as in the original SNE) leads to the crowding problem. What property of the Student-t kernel fixes this?

ExerciseAdvanced

Problem

Why is UMAP faster than t-SNE? Identify the computational bottleneck in t-SNE that UMAP avoids.

ExerciseResearch

Problem

A colleague claims that two clusters being far apart in a UMAP plot means the corresponding data groups are very different. Construct a counterexample and explain why this interpretation is invalid.

References

Canonical:

van der Maaten & Hinton, Visualizing Data using t-SNE (JMLR 2008)
McInnes, Healy, Melville, UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction (2018)

Current:

Kobak & Berens, The Art of Using t-SNE for Single-Cell Transcriptomics (Nature Communications 2019)
Bohm et al., Attraction-Repulsion Spectrum in Neighbor Embeddings (JMLR 2022)

Next Topics

The natural next steps from t-SNE and UMAP:

Autoencoders: learned nonlinear dimensionality reduction
K-means clustering: what to do after you see clusters in the embedding

Last reviewed: April 13, 2026

Canonical graph

Required before and derived from this topic

These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.

Full prerequisite chain All derived topics

Required prerequisites

2

Principal Component Analysislayer 1 · tier 1
Self-Organizing Mapslayer 2 · tier 3

Derived topics

3

K-Means Clusteringlayer 1 · tier 1
Autoencoderslayer 2 · tier 2
Riemannian Optimization and Manifold Constraintslayer 3 · tier 2

Graph-backed continuations

Autoencoders K-Means Clustering Riemannian Optimization and Manifold Constraints