Skip to main content

Applied ML

Clustering for Gene Expression

Graph-based clustering on KNN graphs (Louvain, Leiden) is the default for single-cell expression. UMAP is for visualization, not clustering. Batch correction belongs upstream of the graph.

AdvancedTier 3Current~15 min
0

Why This Matters

The first analytical question on a single-cell or bulk gene-expression dataset is: which samples group together. The answer drives cell-type discovery, marker-gene selection, trajectory inference, and almost every downstream comparison. Yet the clustering itself is shockingly sensitive to upstream choices: normalization, feature selection, batch correction, and the resolution parameter. Two analysts running "the same pipeline" often disagree on cluster count by a factor of two.

The dominant pattern is graph-based clustering on a K-nearest-neighbor graph in a low-dimensional embedding (PCA components, scVI latents). Modularity-maximizing algorithms (Louvain, Leiden) partition the graph; UMAP visualizes it. This pipeline scales to millions of cells and respects the manifold structure of cell-state space better than centroid-based methods like k-means clustering.

The traps are mostly methodological. Resolution is a free parameter; batch effects survive PCA and contaminate clusters; UMAP coordinates are routinely treated as a faithful embedding when they are not. Reading a clustering plot well is half the discipline.

Core Ideas

KNN graph plus modularity. Compute pairwise distances in latent space, build the directed KNN graph (typically K[10,50]K \in [10, 50]), symmetrize. Define modularity for partition CC as Q=12mij(Aijγkikj2m)δ(ci,cj)Q = \frac{1}{2m} \sum_{ij} (A_{ij} - \gamma \frac{k_i k_j}{2m}) \delta(c_i, c_j) where AA is the adjacency, kik_i the degree, mm the edge count, and γ\gamma the resolution parameter. Louvain (Blondel et al. 2008) greedily maximizes QQ by moving nodes between communities. Leiden (Traag, Waltman, van Eck 2019, Sci. Rep. 9) fixes a known Louvain failure where communities can become disconnected during refinement; Leiden guarantees well-connected partitions and converges to higher modularity.

Resolution parameter governs cluster count. Higher γ\gamma yields more, smaller clusters. There is no principled choice; common practice is to scan γ[0.1,2.0]\gamma \in [0.1, 2.0], examine cluster stability across runs, and pick a level matching the biological grain of interest. Tools like clustree visualize how cells split as γ\gamma increases.

UMAP for visualization vs. embedding. UMAP (McInnes, Healy, Melville 2018, arXiv:1802.03426) optimizes a cross-entropy objective between fuzzy simplicial sets in high and low dimensions. It produces compelling 2D plots but distorts global distances and density. Cluster on the high-dimensional latent (PCA, scVI), not on UMAP coordinates. Boundaries that look crisp in 2D often correspond to gradients in the underlying space; cluster splits that look small can hide biologically distinct subpopulations. See t-SNE and UMAP for the dimensionality-reduction side.

Batch correction before the graph. If samples come from different patients, technologies, or processing batches, naive PCA mixes biology with batch. Harmony (Korsunsky et al. 2019, Nature Methods 16) iterates between soft clustering and per-batch linear correction in PCA space, removing batch-specific cluster centroids while preserving cell-type structure. Run Harmony on the PCA embedding, build the KNN graph in the corrected space, then cluster. Other options: scVI/scANVI (latent-space integration), BBKNN (batch-balanced KNN), MNN (mutual nearest neighbors). The Luecken et al. 2022 benchmark in Nature Methods compares them on cell-type preservation vs. batch removal.

Comparison to k-means and hierarchical. K-means assumes spherical clusters of similar variance, which fails for cell types of vastly different abundance and shape in latent space. Hierarchical clustering scales as O(n2)O(n^2) in memory and is impractical for n>105n > 10^5. Graph-based methods scale well and impose no centroid assumption. K-means and hierarchical remain useful for small datasets, gene clustering (where g104g \sim 10^4), and bulk RNA-seq sample clustering.

Common Confusions

Watch Out

Cluster count is not data-determined

There is no algorithm that returns the correct number of cell types from expression data alone. Resolution, KNN KK, and integration choices all change cluster count. Treat clusters as an analytical lens, not ground truth. Validate with known marker genes and biological consistency across replicates.

Watch Out

Running Harmony after clustering does not undo batch effects

Batch correction must happen before the KNN graph is built. If clusters are constructed in uncorrected space and then re-labeled with Harmony embeddings, the batch-driven structure persists in the cluster membership. The order is: normalize, select features, embed, integrate, build graph, cluster.

References

Related Topics

Last reviewed: April 18, 2026

Prerequisites

Foundations this topic depends on.

Next Topics