What Each Does
All three model families learn to generate samples from an unknown data distribution . They differ in how they parameterize and train the generative process.
GANs train a generator against a discriminator in a minimax game. The generator maps noise to data space, and the discriminator tries to distinguish real from generated samples.
VAEs learn an encoder and decoder jointly by maximizing the evidence lower bound (ELBO) on .
Diffusion models define a forward process that gradually adds Gaussian noise to data over steps, then learn to reverse this process step by step.
Side-by-Side Objectives
GAN Objective
The generator and discriminator solve:
At the Nash equilibrium (if it exists), produces samples indistinguishable from real data, and outputs everywhere.
VAE Objective (ELBO)
The encoder and decoder maximize:
The first term is reconstruction quality. The second term regularizes the latent space toward the prior .
Diffusion Objective (Simplified)
The denoising model minimizes:
where and .
Where Each Is Stronger
GANs win on speed and sharpness
A trained GAN generates a sample in a single forward pass through . No iterative sampling, no Markov chain. The adversarial loss directly penalizes blurry outputs because the discriminator can distinguish them from sharp real images. This makes GANs the fastest at inference among the three.
VAEs win on latent space structure
The KL regularization in the ELBO forces the encoder to map similar inputs to nearby latent codes. This gives VAEs smooth, interpretable latent spaces where interpolation and attribute manipulation are straightforward. Neither GANs nor diffusion models guarantee this property without additional architectural choices.
Diffusion models win on sample quality and mode coverage
Diffusion models achieve the highest FID scores on standard benchmarks (ImageNet, LSUN). The training objective is a simple regression loss with no adversarial dynamics, which makes optimization stable. The iterative denoising process covers the full data distribution without mode collapse.
Where Each Fails
GANs fail on training stability and mode coverage
GAN training is a saddle-point optimization, not a minimization. The generator and discriminator can oscillate rather than converge. Mode collapse occurs when maps all noise vectors to a small subset of the data distribution. Diagnosing this failure is difficult because the loss values do not reliably indicate sample quality.
VAEs fail on sample sharpness
The reconstruction term with a Gaussian decoder amounts to minimizing MSE, which averages over possible outputs and produces blurry samples. The KL term further constrains the model, trading sample quality for latent space regularity. More expressive decoders (autoregressive, flow-based) help but add complexity.
Diffusion models fail on sampling speed
Generating one sample requires sequential denoising steps, typically in the original formulation. Even with accelerated samplers (DDIM, DPM-Solver), diffusion models need 20 to 50 steps, each requiring a full neural network forward pass. This is 20x to 50x slower than a single-pass GAN.
Key Assumptions That Differ
| GANs | VAEs | Diffusion | |
|---|---|---|---|
| Training | Minimax game | ELBO maximization | Denoising score matching |
| Latent space | Implicit, unstructured | Explicit, regularized | No persistent latent code |
| Loss landscape | Non-convex saddle point | Single objective | Simple MSE regression |
| Mode coverage | Prone to collapse | Good coverage, blurry | Full coverage, high quality |
| Sampling cost | 1 forward pass | 1 forward pass | forward passes |
The Tradeoff Triangle
Quality-Speed-Coverage Tradeoff
Statement
Among the three families, no single approach dominates on all three axes simultaneously:
- GANs: fast sampling, sharp outputs, but poor mode coverage.
- VAEs: fast sampling, good mode coverage, but blurry outputs.
- Diffusion: sharp outputs, good mode coverage, but slow sampling.
Recent work on consistency models and distillation attempts to break this tradeoff by distilling a diffusion model into a single-step generator.
Intuition
Sharp samples require the model to commit to specific details rather than hedging. Mode coverage requires exploring the full distribution. Fast sampling requires few sequential steps. Achieving all three simultaneously is hard because hedging (averaging over modes) is the cheapest way to cover modes, and sequential refinement is the cheapest way to sharpen.
Failure Mode
This tradeoff is empirical, not a proven impossibility. Consistency models and rectified flows suggest the tradeoff can be softened, though not eliminated entirely with current methods.
When a Researcher Would Use Each
Real-time image synthesis (game engines, interactive tools)
Use GANs (or distilled diffusion models). The single-pass generation is necessary for real-time applications. StyleGAN and its variants remain competitive for face generation and similar domains with limited diversity requirements.
Representation learning and disentanglement
Use VAEs. The structured latent space supports downstream tasks: clustering, interpolation, and controlled generation. Beta-VAE and its variants offer explicit control over the reconstruction-regularization tradeoff.
Maximum quality image/video generation
Use diffusion models. When sampling speed is not the bottleneck (offline generation, batch processing), diffusion models produce the best results. Text-to-image systems (Stable Diffusion, DALL-E 3, Imagen) all use diffusion.
Common Confusions
VAEs are not just worse diffusion models
VAEs and diffusion models solve different problems. VAEs provide a structured latent space with explicit encoder and decoder. Diffusion models provide no persistent latent representation. If you need a latent code for downstream tasks, a VAE is the right tool. If you need maximum sample quality, use diffusion.
GANs are not obsolete
Despite diffusion models dominating benchmarks, GANs remain useful when inference latency matters. A trained GAN generates samples in milliseconds. Diffusion models, even with fast samplers, take seconds. For real-time applications, GANs or GAN-like distilled models are still the practical choice.
Mode collapse is a GAN-specific failure
VAEs do not suffer mode collapse because the ELBO objective covers all data points equally. Diffusion models avoid it because the denoising objective trains on all noise levels uniformly. Mode collapse is specific to the adversarial training dynamic where the generator can "fool" the discriminator by producing a narrow set of high-quality samples.
References
Canonical:
- Goodfellow et al., Generative Adversarial Nets (NeurIPS 2014)
- Kingma & Welling, Auto-Encoding Variational Bayes (ICLR 2014)
- Ho, Jain, Abbeel, Denoising Diffusion Probabilistic Models (NeurIPS 2020)
Current:
- Song et al., Consistency Models (ICML 2023)
- Karras et al., Elucidating the Design Space of Diffusion-Based Generative Models (NeurIPS 2022)