Beta. Content is under active construction and has not been peer-reviewed. Report errors on GitHub.Disclaimer

Comparison

Diffusion Models vs. GANs vs. VAEs

Three generative model families compared: GANs use adversarial training for sharp samples but suffer mode collapse, VAEs optimize ELBO for smooth latent spaces but produce blurry outputs, and diffusion models iteratively denoise for high quality at the cost of slow sampling.

What Each Does

All three model families learn to generate samples from an unknown data distribution pdata(x)p_{\text{data}}(x). They differ in how they parameterize and train the generative process.

GANs train a generator GG against a discriminator DD in a minimax game. The generator maps noise zp(z)z \sim p(z) to data space, and the discriminator tries to distinguish real from generated samples.

VAEs learn an encoder qϕ(zx)q_\phi(z|x) and decoder pθ(xz)p_\theta(x|z) jointly by maximizing the evidence lower bound (ELBO) on logp(x)\log p(x).

Diffusion models define a forward process that gradually adds Gaussian noise to data over TT steps, then learn to reverse this process step by step.

Side-by-Side Objectives

Definition

GAN Objective

The generator and discriminator solve:

minGmaxD  Expdata[logD(x)]+Ezp(z)[log(1D(G(z)))]\min_G \max_D \; \mathbb{E}_{x \sim p_{\text{data}}}[\log D(x)] + \mathbb{E}_{z \sim p(z)}[\log(1 - D(G(z)))]

At the Nash equilibrium (if it exists), GG produces samples indistinguishable from real data, and DD outputs 1/21/2 everywhere.

Definition

VAE Objective (ELBO)

The encoder and decoder maximize:

L(θ,ϕ)=Eqϕ(zx)[logpθ(xz)]DKL(qϕ(zx)p(z))\mathcal{L}(\theta, \phi) = \mathbb{E}_{q_\phi(z|x)}[\log p_\theta(x|z)] - D_{\text{KL}}(q_\phi(z|x) \| p(z))

The first term is reconstruction quality. The second term regularizes the latent space toward the prior p(z)p(z).

Definition

Diffusion Objective (Simplified)

The denoising model ϵθ\epsilon_\theta minimizes:

Lsimple=Et,x0,ϵ[ϵϵθ(xt,t)2]\mathcal{L}_{\text{simple}} = \mathbb{E}_{t, x_0, \epsilon}\left[\|\epsilon - \epsilon_\theta(x_t, t)\|^2\right]

where xt=αˉtx0+1αˉtϵx_t = \sqrt{\bar{\alpha}_t}\, x_0 + \sqrt{1 - \bar{\alpha}_t}\, \epsilon and ϵN(0,I)\epsilon \sim \mathcal{N}(0, I).

Where Each Is Stronger

GANs win on speed and sharpness

A trained GAN generates a sample in a single forward pass through GG. No iterative sampling, no Markov chain. The adversarial loss directly penalizes blurry outputs because the discriminator can distinguish them from sharp real images. This makes GANs the fastest at inference among the three.

VAEs win on latent space structure

The KL regularization in the ELBO forces the encoder to map similar inputs to nearby latent codes. This gives VAEs smooth, interpretable latent spaces where interpolation and attribute manipulation are straightforward. Neither GANs nor diffusion models guarantee this property without additional architectural choices.

Diffusion models win on sample quality and mode coverage

Diffusion models achieve the highest FID scores on standard benchmarks (ImageNet, LSUN). The training objective is a simple regression loss with no adversarial dynamics, which makes optimization stable. The iterative denoising process covers the full data distribution without mode collapse.

Where Each Fails

GANs fail on training stability and mode coverage

GAN training is a saddle-point optimization, not a minimization. The generator and discriminator can oscillate rather than converge. Mode collapse occurs when GG maps all noise vectors to a small subset of the data distribution. Diagnosing this failure is difficult because the loss values do not reliably indicate sample quality.

VAEs fail on sample sharpness

The reconstruction term Eqϕ(zx)[logpθ(xz)]\mathbb{E}_{q_\phi(z|x)}[\log p_\theta(x|z)] with a Gaussian decoder amounts to minimizing MSE, which averages over possible outputs and produces blurry samples. The KL term further constrains the model, trading sample quality for latent space regularity. More expressive decoders (autoregressive, flow-based) help but add complexity.

Diffusion models fail on sampling speed

Generating one sample requires TT sequential denoising steps, typically T=1000T = 1000 in the original formulation. Even with accelerated samplers (DDIM, DPM-Solver), diffusion models need 20 to 50 steps, each requiring a full neural network forward pass. This is 20x to 50x slower than a single-pass GAN.

Key Assumptions That Differ

GANsVAEsDiffusion
TrainingMinimax gameELBO maximizationDenoising score matching
Latent spaceImplicit, unstructuredExplicit, regularizedNo persistent latent code
Loss landscapeNon-convex saddle pointSingle objectiveSimple MSE regression
Mode coverageProne to collapseGood coverage, blurryFull coverage, high quality
Sampling cost1 forward pass1 forward passTT forward passes

The Tradeoff Triangle

Theorem

Quality-Speed-Coverage Tradeoff

Statement

Among the three families, no single approach dominates on all three axes simultaneously:

  1. GANs: fast sampling, sharp outputs, but poor mode coverage.
  2. VAEs: fast sampling, good mode coverage, but blurry outputs.
  3. Diffusion: sharp outputs, good mode coverage, but slow sampling.

Recent work on consistency models and distillation attempts to break this tradeoff by distilling a diffusion model into a single-step generator.

Intuition

Sharp samples require the model to commit to specific details rather than hedging. Mode coverage requires exploring the full distribution. Fast sampling requires few sequential steps. Achieving all three simultaneously is hard because hedging (averaging over modes) is the cheapest way to cover modes, and sequential refinement is the cheapest way to sharpen.

Failure Mode

This tradeoff is empirical, not a proven impossibility. Consistency models and rectified flows suggest the tradeoff can be softened, though not eliminated entirely with current methods.

When a Researcher Would Use Each

Example

Real-time image synthesis (game engines, interactive tools)

Use GANs (or distilled diffusion models). The single-pass generation is necessary for real-time applications. StyleGAN and its variants remain competitive for face generation and similar domains with limited diversity requirements.

Example

Representation learning and disentanglement

Use VAEs. The structured latent space supports downstream tasks: clustering, interpolation, and controlled generation. Beta-VAE and its variants offer explicit control over the reconstruction-regularization tradeoff.

Example

Maximum quality image/video generation

Use diffusion models. When sampling speed is not the bottleneck (offline generation, batch processing), diffusion models produce the best results. Text-to-image systems (Stable Diffusion, DALL-E 3, Imagen) all use diffusion.

Common Confusions

Watch Out

VAEs are not just worse diffusion models

VAEs and diffusion models solve different problems. VAEs provide a structured latent space with explicit encoder and decoder. Diffusion models provide no persistent latent representation. If you need a latent code for downstream tasks, a VAE is the right tool. If you need maximum sample quality, use diffusion.

Watch Out

GANs are not obsolete

Despite diffusion models dominating benchmarks, GANs remain useful when inference latency matters. A trained GAN generates samples in milliseconds. Diffusion models, even with fast samplers, take seconds. For real-time applications, GANs or GAN-like distilled models are still the practical choice.

Watch Out

Mode collapse is a GAN-specific failure

VAEs do not suffer mode collapse because the ELBO objective covers all data points equally. Diffusion models avoid it because the denoising objective trains on all noise levels uniformly. Mode collapse is specific to the adversarial training dynamic where the generator can "fool" the discriminator by producing a narrow set of high-quality samples.

References

Canonical:

Current: