Diffusion Models vs GANs vs VAEs. When to Use Each

What Each Does

All three model families learn to generate samples from an unknown data distribution $p_{\text{data}}(x)$ . They differ in how they parameterize and train the generative process.

GANs train a generator $G$ against a discriminator $D$ in a minimax game. The generator maps noise $z \sim p(z)$ to data space, and the discriminator tries to distinguish real from generated samples.

VAEs learn an encoder $q_\phi(z|x)$ and decoder $p_\theta(x|z)$ jointly by maximizing the evidence lower bound (ELBO) on $\log p(x)$ .

Diffusion models define a forward process that gradually adds Gaussian noise to data over $T$ steps, then learn to reverse this process step by step.

Side-by-Side Objectives

Definition

GAN Objective

The generator and discriminator solve:

$\min_G \max_D \; \mathbb{E}_{x \sim p_{\text{data}}}[\log D(x)] + \mathbb{E}_{z \sim p(z)}[\log(1 - D(G(z)))]$

At the Nash equilibrium (if it exists), $G$ produces samples indistinguishable from real data, and $D$ outputs $1/2$ everywhere.

Definition

VAE Objective (ELBO)

The encoder and decoder maximize:

$\mathcal{L}(\theta, \phi) = \mathbb{E}_{q_\phi(z|x)}[\log p_\theta(x|z)] - D_{\text{KL}}(q_\phi(z|x) \| p(z))$

The first term is reconstruction quality. The second term regularizes the latent space toward the prior $p(z)$ .

Definition

Diffusion Objective (Simplified)

The denoising model $\epsilon_\theta$ minimizes:

$\mathcal{L}_{\text{simple}} = \mathbb{E}_{t, x_0, \epsilon}\left[\|\epsilon - \epsilon_\theta(x_t, t)\|^2\right]$

where $x_t = \sqrt{\bar{\alpha}_t}\, x_0 + \sqrt{1 - \bar{\alpha}_t}\, \epsilon$ and $\epsilon \sim \mathcal{N}(0, I)$ .

Where Each Is Stronger

GANs win on speed and sharpness

A trained GAN generates a sample in a single forward pass through $G$ . No iterative sampling, no Markov chain. The adversarial loss directly penalizes blurry outputs because the discriminator can distinguish them from sharp real images. This makes GANs the fastest at inference among the three.

VAEs win on latent space structure

The KL regularization in the ELBO forces the encoder to map similar inputs to nearby latent codes. This gives VAEs smooth, interpretable latent spaces where interpolation and attribute manipulation are straightforward. Neither GANs nor diffusion models guarantee this property without additional architectural choices.

Diffusion models win on sample quality and mode coverage

Diffusion models achieve the highest FID scores on standard benchmarks (ImageNet, LSUN). The training objective is a simple regression loss with no adversarial dynamics, which makes optimization stable. The iterative denoising process covers the full data distribution without mode collapse.

Where Each Fails

GANs fail on training stability and mode coverage

GAN training is a saddle-point optimization, not a minimization. The generator and discriminator can oscillate rather than converge. Mode collapse occurs when $G$ maps all noise vectors to a small subset of the data distribution. Diagnosing this failure is difficult because the loss values do not reliably indicate sample quality.

VAEs fail on sample sharpness

The reconstruction term $\mathbb{E}_{q_\phi(z|x)}[\log p_\theta(x|z)]$ with a Gaussian decoder amounts to minimizing MSE, which averages over possible outputs and produces blurry samples. The KL term further constrains the model, trading sample quality for latent space regularity. More expressive decoders (autoregressive, flow-based) help but add complexity.

Diffusion models fail on sampling speed

Generating one sample requires $T$ sequential denoising steps, typically $T = 1000$ in the original formulation. Even with accelerated samplers (DDIM, DPM-Solver), diffusion models need 20 to 50 steps, each requiring a full neural network forward pass. This is 20x to 50x slower than a single-pass GAN.

Key Assumptions That Differ

	GANs	VAEs	Diffusion
Training	Minimax game	ELBO maximization	Denoising score matching
Latent space	Implicit, unstructured	Explicit, regularized	No persistent latent code
Loss landscape	Non-convex saddle point	Single objective	Simple MSE regression
Mode coverage	Prone to collapse	Good coverage, blurry	Full coverage, high quality
Sampling cost	1 forward pass	1 forward pass	$T$ forward passes

The Tradeoff Triangle

Proposition

Quality-Speed-Coverage Tradeoff

Statement

Among the three families, no single approach dominates on all three axes simultaneously:

GANs: fast sampling, sharp outputs, but poor mode coverage.
VAEs: fast sampling, good mode coverage, but blurry outputs.
Diffusion: sharp outputs, good mode coverage, but slow sampling.

Recent work on consistency models and distillation attempts to break this tradeoff by distilling a diffusion model into a single-step generator.

Intuition

Sharp samples require the model to commit to specific details rather than hedging. Mode coverage requires exploring the full distribution. Fast sampling requires few sequential steps. Achieving all three simultaneously is hard because hedging (averaging over modes) is the cheapest way to cover modes, and sequential refinement is the cheapest way to sharpen.

Failure Mode

This tradeoff is empirical, not a proven impossibility. Consistency models and rectified flows suggest the tradeoff can be softened, though not eliminated entirely with current methods.

report a correction →

When a Researcher Would Use Each

Example

Real-time image synthesis (game engines, interactive tools)

Use GANs (or distilled diffusion models). The single-pass generation is necessary for real-time applications. StyleGAN and its variants remain competitive for face generation and similar domains with limited diversity requirements.

Example

Representation learning and disentanglement

Use VAEs. The structured latent space supports downstream tasks: clustering, interpolation, and controlled generation. Beta-VAE and its variants offer explicit control over the reconstruction-regularization tradeoff.

Example

Maximum quality image/video generation

Use diffusion models. When sampling speed is not the bottleneck (offline generation, batch processing), diffusion models produce the best results. Text-to-image systems (Stable Diffusion, DALL-E 3, Imagen) all use diffusion.

Common Confusions

Watch Out

VAEs are not just worse diffusion models

VAEs and diffusion models solve different problems. VAEs provide a structured latent space with explicit encoder and decoder. Diffusion models provide no persistent latent representation. If you need a latent code for downstream tasks, a VAE is the right tool. If you need maximum sample quality, use diffusion.

Watch Out

GANs are not obsolete

Despite diffusion models dominating benchmarks, GANs remain useful when inference latency matters. A trained GAN generates samples in milliseconds. Diffusion models, even with fast samplers, take seconds. For real-time applications, GANs or GAN-like distilled models are still the practical choice.

Watch Out

Mode collapse is a GAN-specific failure

VAEs do not suffer mode collapse because the ELBO objective covers all data points equally. Diffusion models avoid it because the denoising objective trains on all noise levels uniformly. Mode collapse is specific to the adversarial training dynamic where the generator can "fool" the discriminator by producing a narrow set of high-quality samples.

References

Canonical:

Goodfellow et al., Generative Adversarial Nets (NeurIPS 2014)
Kingma & Welling, Auto-Encoding Variational Bayes (ICLR 2014)
Ho, Jain, Abbeel, Denoising Diffusion Probabilistic Models (NeurIPS 2020)

Current:

Song et al., Consistency Models (ICML 2023)
Karras et al., Elucidating the Design Space of Diffusion-Based Generative Models (NeurIPS 2022)