Applied ML
Autoencoders for Single-Cell RNA-seq
Variational autoencoders model single-cell counts with negative-binomial likelihoods, regularize batch effects in latent space, and impute dropout zeros. scVI and scANVI are the current defaults.
Prerequisites
Why This Matters
A single-cell RNA-seq experiment produces a sparse count matrix: where is the number of cells (often to ) and the number of genes (about ). Most entries are zero, partly because the gene is not expressed and partly because shallow sequencing missed it. The biological questions are: which cells are similar, what cell types exist, which genes drive transitions between states, and how do these answers change across patients or treatments.
Linear methods (PCA followed by graph clustering) remain the workhorse. But they assume Gaussian residuals after log transformation, which mishandles count noise, dropout zeros, and depth differences across cells. Autoencoders give a principled count likelihood and a smooth latent space; variational autoencoders add a generative model that supports posterior inference, batch correction, and likelihood-based comparison across experiments.
The dominant model is scVI (Lopez et al. 2018, Nature Methods 15). It treats observed counts as zero-inflated negative binomial samples whose parameters depend on a low-dimensional latent state and a per-cell library size, and learns the encoder and decoder by amortized variational inference.
Core Ideas
Negative binomial likelihood for counts. Gene expression counts are overdispersed; Poisson assumes mean equals variance, which fails. The negative binomial parameterizes mean and dispersion separately so . scVI models gene in cell as where , is library size, and comes from a neural decoder applied to latent .
Variational inference with amortized encoders. The encoder outputs a Gaussian over latent state and a log-normal over library size; is the batch indicator. The decoder is conditioned on as well, which lets the latent stay batch-free while batch-specific shifts are absorbed into the decoder. Training maximizes the ELBO with the KL to a standard normal prior. See variational autoencoders for the general setup.
Batch correction by latent regularization. Conditioning the decoder on batch is the simplest approach. scVI extensions add adversarial discriminators that penalize batch predictability from , or use mixture priors. scANVI (Xu et al. 2021) extends scVI to a semi-supervised setting where partial cell-type labels guide the latent space, improving label transfer to unlabeled query datasets.
Dropout imputation. Many observed zeros are technical, not biological. The decoder mean is a denoised expression estimate even where the input was zero. Eraslan et al. (2019) introduced DCA (deep count autoencoder) using a zero-inflated negative binomial autoencoder explicitly for imputation, before scVI generalized this with variational inference.
Comparison to PCA plus UMAP. The standard pipeline log-normalizes counts, runs PCA to a few dozen components, builds a KNN graph, and embeds with UMAP for visualization. Linear PCA on log-counts works remarkably well for visualization and clustering when batch effects are mild. VAE-based latents win when batches are strong, when count noise dominates, or when downstream tasks require a generative model (differential expression with calibrated uncertainty, perturbation prediction). See t-SNE and UMAP for the visualization side.
Common Confusions
UMAP coordinates are for visualization, not for computation
The 2D UMAP layout is not a faithful embedding for distance, density, or clustering. Build the KNN graph and cluster on the higher-dimensional latent (PCA components or VAE ). Interpreting cluster boundaries from a UMAP plot regularly produces artifacts; the same data with different UMAP seeds can look qualitatively different.
Dropout zeros are not all technical
Imputation methods replace zeros with predicted expression based on neighbors. If a gene is genuinely off in a cell type, imputation can hallucinate expression and create false co-expression patterns. Use imputed values for visualization and downstream models that need smooth inputs, not for hypothesis testing about gene regulation.
References
Related Topics
Last reviewed: April 18, 2026
Prerequisites
Foundations this topic depends on.
- Variational AutoencodersLayer 3
- AutoencodersLayer 2
- Feedforward Networks and BackpropagationLayer 2
- Differentiation in RnLayer 0A
- Sets, Functions, and RelationsLayer 0A
- Basic Logic and Proof TechniquesLayer 0A
- Matrix CalculusLayer 1
- The Jacobian MatrixLayer 0A
- The Hessian MatrixLayer 0A
- Matrix Operations and PropertiesLayer 0A
- Eigenvalues and EigenvectorsLayer 0A
- Activation FunctionsLayer 1
- Convex Optimization BasicsLayer 1
- Vectors, Matrices, and Linear MapsLayer 0A
- Maximum Likelihood EstimationLayer 0B
- Common Probability DistributionsLayer 0A