Skip to main content

Applied ML

Autoencoders for Single-Cell RNA-seq

Variational autoencoders model single-cell counts with negative-binomial likelihoods, regularize batch effects in latent space, and impute dropout zeros. scVI and scANVI are the current defaults.

AdvancedTier 3Current~15 min
0

Why This Matters

A single-cell RNA-seq experiment produces a sparse count matrix: XNn×gX \in \mathbb{N}^{n \times g} where nn is the number of cells (often 10510^5 to 10710^7) and gg the number of genes (about 2×1042 \times 10^4). Most entries are zero, partly because the gene is not expressed and partly because shallow sequencing missed it. The biological questions are: which cells are similar, what cell types exist, which genes drive transitions between states, and how do these answers change across patients or treatments.

Linear methods (PCA followed by graph clustering) remain the workhorse. But they assume Gaussian residuals after log transformation, which mishandles count noise, dropout zeros, and depth differences across cells. Autoencoders give a principled count likelihood and a smooth latent space; variational autoencoders add a generative model that supports posterior inference, batch correction, and likelihood-based comparison across experiments.

The dominant model is scVI (Lopez et al. 2018, Nature Methods 15). It treats observed counts as zero-inflated negative binomial samples whose parameters depend on a low-dimensional latent state and a per-cell library size, and learns the encoder and decoder by amortized variational inference.

Core Ideas

Negative binomial likelihood for counts. Gene expression counts are overdispersed; Poisson assumes mean equals variance, which fails. The negative binomial parameterizes mean μ\mu and dispersion θ\theta separately so Var[X]=μ+μ2/θ\mathrm{Var}[X] = \mu + \mu^2 / \theta. scVI models gene jj in cell ii as XijNB(μij,θj)X_{ij} \sim \mathrm{NB}(\mu_{ij}, \theta_j) where μij=iρij\mu_{ij} = \ell_i \rho_{ij}, i\ell_i is library size, and ρij\rho_{ij} comes from a neural decoder applied to latent ziz_i.

Variational inference with amortized encoders. The encoder qϕ(zi,ixi,bi)q_\phi(z_i, \ell_i \mid x_i, b_i) outputs a Gaussian over latent state and a log-normal over library size; bib_i is the batch indicator. The decoder is conditioned on bib_i as well, which lets the latent ziz_i stay batch-free while batch-specific shifts are absorbed into the decoder. Training maximizes the ELBO with the KL to a standard normal prior. See variational autoencoders for the general setup.

Batch correction by latent regularization. Conditioning the decoder on batch is the simplest approach. scVI extensions add adversarial discriminators that penalize batch predictability from zz, or use mixture priors. scANVI (Xu et al. 2021) extends scVI to a semi-supervised setting where partial cell-type labels guide the latent space, improving label transfer to unlabeled query datasets.

Dropout imputation. Many observed zeros are technical, not biological. The decoder mean μij\mu_{ij} is a denoised expression estimate even where the input was zero. Eraslan et al. (2019) introduced DCA (deep count autoencoder) using a zero-inflated negative binomial autoencoder explicitly for imputation, before scVI generalized this with variational inference.

Comparison to PCA plus UMAP. The standard pipeline log-normalizes counts, runs PCA to a few dozen components, builds a KNN graph, and embeds with UMAP for visualization. Linear PCA on log-counts works remarkably well for visualization and clustering when batch effects are mild. VAE-based latents win when batches are strong, when count noise dominates, or when downstream tasks require a generative model (differential expression with calibrated uncertainty, perturbation prediction). See t-SNE and UMAP for the visualization side.

Common Confusions

Watch Out

UMAP coordinates are for visualization, not for computation

The 2D UMAP layout is not a faithful embedding for distance, density, or clustering. Build the KNN graph and cluster on the higher-dimensional latent (PCA components or VAE zz). Interpreting cluster boundaries from a UMAP plot regularly produces artifacts; the same data with different UMAP seeds can look qualitatively different.

Watch Out

Dropout zeros are not all technical

Imputation methods replace zeros with predicted expression based on neighbors. If a gene is genuinely off in a cell type, imputation can hallucinate expression and create false co-expression patterns. Use imputed values for visualization and downstream models that need smooth inputs, not for hypothesis testing about gene regulation.

References

Related Topics

Last reviewed: April 18, 2026

Prerequisites

Foundations this topic depends on.

Next Topics