Autoencoders for Single-Cell RNA-seq

Sneiderman, Robby

Applied ML

Autoencoders for Single-Cell RNA-seq

Variational autoencoders model single-cell counts with negative-binomial likelihoods, regularize batch effects in latent space, and impute dropout zeros. scVI and scANVI are the current defaults.

AdvancedTier 3CurrentReference~15 min

Prerequisites

Variational Autoencoders Autoencoders

Prereq Map

Learning position

Read this page in the graph.

applied-ml | layer 4 | tier 3. This page has 2 direct prerequisites and 0 published dependents.

Open Atlas Prerequisites Leads to

What next

Dimensionality Reduction Theory

This is the first curated or graph-derived continuation from the current page.

Evidence badge

Claim status

This page has no public Lean mapping yet. Use the evidence page to inspect how claim status labels work.

Show the backing system

AtlasOpen the full prerequisite graph and run grounding traces.EvidenceInspect source support, claim labels, and public trust status.LeanReview the checked declaration list, scopes, and axiom profile.

Why This Matters

A single-cell RNA-seq experiment produces a sparse count matrix: $X \in \mathbb{N}^{n \times g}$ where $n$ is the number of cells (often $10^5$ to $10^7$ ) and $g$ the number of genes (about $2 \times 10^4$ ). Most entries are zero, partly because the gene is not expressed and partly because shallow sequencing missed it. The biological questions are: which cells are similar, what cell types exist, which genes drive transitions between states, and how do these answers change across patients or treatments.

Linear methods (PCA followed by graph clustering) remain the workhorse. But they assume Gaussian residuals after log transformation, which mishandles count noise, dropout zeros, and depth differences across cells. Autoencoders give a principled count likelihood and a smooth latent space; variational autoencoders add a generative model that supports posterior inference, batch correction, and likelihood-based comparison across experiments.

The dominant model is scVI (Lopez et al. 2018, Nature Methods 15). It treats observed counts as zero-inflated negative binomial samples whose parameters depend on a low-dimensional latent state and a per-cell library size, and learns the encoder and decoder by amortized variational inference.

Core Ideas

Negative binomial likelihood for counts. Gene expression counts are overdispersed; Poisson assumes mean equals variance, which fails. The negative binomial parameterizes mean $\mu$ and dispersion $\theta$ separately so $\mathrm{Var}[X] = \mu + \mu^2 / \theta$ . scVI models gene $j$ in cell $i$ as $X_{ij} \sim \mathrm{NB}(\mu_{ij}, \theta_j)$ where $\mu_{ij} = \ell_i \rho_{ij}$ , $\ell_i$ is library size, and $\rho_{ij}$ comes from a neural decoder applied to latent $z_i$ .

Variational inference with amortized encoders. The encoder $q_\phi(z_i, \ell_i \mid x_i, b_i)$ outputs a Gaussian over latent state and a log-normal over library size; $b_i$ is the batch indicator. The decoder is conditioned on $b_i$ as well, which lets the latent $z_i$ stay batch-free while batch-specific shifts are absorbed into the decoder. Training maximizes the ELBO with the KL to a standard normal prior. See variational autoencoders for the general setup.

Batch correction by latent regularization. Conditioning the decoder on batch is the simplest approach. scVI extensions add adversarial discriminators that penalize batch predictability from $z$ , or use mixture priors. scANVI (Xu et al. 2021) extends scVI to a semi-supervised setting where partial cell-type labels guide the latent space, improving label transfer to unlabeled query datasets.

Dropout imputation. Many observed zeros are technical, not biological. The decoder mean $\mu_{ij}$ is a denoised expression estimate even where the input was zero. Eraslan et al. (2019) introduced DCA (deep count autoencoder) using a zero-inflated negative binomial autoencoder explicitly for imputation, before scVI generalized this with variational inference.

Comparison to PCA plus UMAP. The standard pipeline log-normalizes counts, runs PCA to a few dozen components, builds a KNN graph, and embeds with UMAP for visualization. Linear PCA on log-counts works remarkably well for visualization and clustering when batch effects are mild. VAE-based latents win when batches are strong, when count noise dominates, or when downstream tasks require a generative model (differential expression with calibrated uncertainty, perturbation prediction). See t-SNE and UMAP for the visualization side.

Common Confusions

Watch Out

UMAP coordinates are for visualization, not for computation

The 2D UMAP layout is not a faithful embedding for distance, density, or clustering. Build the KNN graph and cluster on the higher-dimensional latent (PCA components or VAE $z$ ). Interpreting cluster boundaries from a UMAP plot regularly produces artifacts; the same data with different UMAP seeds can look qualitatively different.

Watch Out

Dropout zeros are not all technical

Imputation methods replace zeros with predicted expression based on neighbors. If a gene is genuinely off in a cell type, imputation can hallucinate expression and create false co-expression patterns. Use imputed values for visualization and downstream models that need smooth inputs, not for hypothesis testing about gene regulation.

References

Canonical:

Lopez, Regier, Cole, Jordan, Yosef. "Deep generative modeling for single-cell transcriptomics (scVI)." Nature Methods 15, 2018. https://www.nature.com/articles/s41592-018-0229-2
Eraslan, Simon, Mircea, Mueller, Theis. "Single-cell RNA-seq denoising using a deep count autoencoder (DCA)." Nature Communications 10, 2019. https://www.nature.com/articles/s41467-018-07931-2

Current:

Xu et al. "Probabilistic harmonization and annotation of single-cell transcriptomics data with deep generative models (scANVI)." Molecular Systems Biology 17, 2021. https://www.embopress.org/doi/full/10.15252/msb.20209620
Korsunsky et al. "Fast, sensitive and accurate integration of single-cell data with Harmony." Nature Methods 16, 2019. https://www.nature.com/articles/s41592-019-0619-0
Gayoso et al. "A Python library for probabilistic analysis of single-cell omics data (scvi-tools)." Nature Biotechnology 40, 2022. https://www.nature.com/articles/s41587-021-01206-w
Luecken et al. "Benchmarking atlas-level data integration in single-cell genomics." Nature Methods 19, 2022. https://www.nature.com/articles/s41592-021-01336-8

Required before and derived from this topic

These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.

Full prerequisite chain All derived topics

Required prerequisites

2

Variational Autoencoderslayer 3 · tier 1
Autoencoderslayer 2 · tier 2

Derived topics

0

No published topic currently declares this as a prerequisite.