ML Methods
Generative Adversarial Networks
The minimax game between generator and discriminator: Nash equilibrium at the data distribution, mode collapse, the Wasserstein distance fix, StyleGAN, and why diffusion models have largely replaced GANs for image generation.
Prerequisites
Why This Matters
GANs introduced a new approach to generative modeling: instead of explicitly modeling a density, train a generator to fool a discriminator. Both networks are trained via backpropagation. This adversarial training produced the first photorealistic synthetic images and remains influential despite being largely superseded by diffusion models for image generation.
Understanding GANs matters because the adversarial framework appears throughout ML: domain adaptation, data augmentation, robustness testing, and the discriminator concept underpins many evaluation methods (FID uses features from a classifier trained in a GAN-like setup).
Mental Model
Two networks play a game. The generator takes random noise and produces a fake sample . The discriminator receives either a real sample or a fake sample and tries to distinguish them. The generator wins when the discriminator cannot tell real from fake. At equilibrium, the generator produces samples indistinguishable from the real data distribution.
The Minimax Objective
GAN Objective
The GAN training objective is:
The discriminator outputs the probability that is real. The discriminator maximizes (correctly classify real and fake). The generator minimizes (make the discriminator wrong).
Optimal Discriminator and Nash Equilibrium
Optimal Discriminator
Statement
For a fixed generator , the optimal discriminator is:
where is the density induced by . Substituting back into the objective gives:
where JSD is the Jensen-Shannon divergence. The global minimum is achieved when , giving and everywhere.
Intuition
The optimal discriminator is a likelihood ratio test. When the generator perfectly matches the data distribution, no discriminator can do better than random guessing (). The objective reduces to measuring the Jensen-Shannon divergence between real and generated distributions.
Proof Sketch
For fixed , the integrand of at each point is . This is maximized over by taking the derivative and setting it to zero, giving . Substituting and using the definition of JSD yields the result.
Why It Matters
This theorem shows that GAN training implicitly minimizes the Jensen-Shannon divergence (related to KL divergence) between the generated and real distributions. This connects the adversarial game to a well-defined statistical divergence, giving the framework a theoretical foundation.
Failure Mode
The proof assumes unlimited discriminator capacity and continuous densities. In practice, the discriminator is a finite neural network that may not approximate well. When and have disjoint supports (common in high dimensions), the JSD saturates at , providing zero gradient to the generator. This is the vanishing gradient problem of GANs.
Mode Collapse
Mode Collapse
The generator produces samples from only a few modes of the data distribution, ignoring the rest. For example, a GAN trained on digits might produce excellent 3s and 7s but never generate 0s or 5s.
Mode collapse occurs because the generator can reduce the discriminator's accuracy by perfecting a few modes rather than covering all modes. The minimax solution (where plays first) is , but the maximin solution (where plays first) can concentrate on a single mode. In practice, alternating gradient descent does not guarantee convergence to the minimax solution.
No fully reliable solution to mode collapse exists. Techniques that help: minibatch discrimination (let the discriminator see batches, not individual samples), spectral normalization, and diverse training objectives.
Wasserstein GAN
Wasserstein GAN Objective
Statement
The Wasserstein-1 (earth mover's) distance between two distributions is:
where the supremum is over all 1-Lipschitz functions . The WGAN objective replaces the discriminator with a critic (no sigmoid, no probability output) and maximizes this over while minimizing over .
Intuition
The earth mover's distance measures the minimum cost of transporting mass from to . Unlike JSD, it provides a meaningful gradient even when the distributions have disjoint supports. If is far from , the Wasserstein distance is large and its gradient points toward .
Proof Sketch
This is the Kantorovich-Rubinstein dual representation of the Wasserstein distance. The primal formulation minimizes the expected transport cost over all couplings of and . The dual formulation replaces the coupling with a Lipschitz function. The Lipschitz constraint ensures the dual is bounded.
Why It Matters
WGAN training is more stable than standard GAN training because the Wasserstein distance does not saturate when distributions are far apart. The critic loss correlates with sample quality, unlike the standard GAN discriminator loss, which can oscillate while generation quality improves.
Failure Mode
Enforcing the Lipschitz constraint is difficult. Weight clipping (the original WGAN approach) limits the critic's capacity and can cause training instabilities. Gradient penalty (WGAN-GP) penalizes the gradient norm deviating from 1, which works better but adds computational cost. Spectral normalization is another approach.
StyleGAN
StyleGAN (Karras et al., 2019) produces high-resolution photorealistic images by separating style from content. The latent code is first mapped to an intermediate space via a mapping network. The style vector is injected at each layer of the generator via adaptive instance normalization (AdaIN), controlling coarse features (pose, shape) at early layers and fine features (texture, color) at later layers.
Progressive growing (training at increasing resolutions) and skip connections enable stable training at resolutions up to 1024x1024. StyleGAN2 and StyleGAN3 further improved quality and removed artifacts.
Why Diffusion Models Replaced GANs
For image generation, diffusion models have largely superseded GANs since 2021. The reasons:
- Training stability: Diffusion models optimize a simple denoising objective with no adversarial dynamics. No mode collapse, no training oscillation.
- Mode coverage: Diffusion models naturally cover all modes of the distribution because the denoising objective is a weighted sum over all noise levels.
- Composability: Classifier-free guidance and conditioning are straightforward in diffusion models. GAN conditioning requires architectural modifications.
- Sample quality: Diffusion models now match or exceed GAN quality on standard benchmarks (FID on ImageNet).
GANs retain advantages in inference speed (one forward pass vs. many denoising steps) and in discriminative tasks where the adversarial framework is natural.
Common Confusions
GANs do not learn a density
Unlike VAEs or normalizing flows, GANs do not produce an explicit density . You can sample from the generator but cannot evaluate the probability of a given sample. This makes GANs unsuitable for tasks that require density evaluation (e.g., anomaly detection via likelihood).
Discriminator loss going to zero does not mean training succeeded
In standard GAN training, the discriminator loss approaching zero often means the discriminator is winning too easily: it perfectly distinguishes real from fake. This indicates the generator is failing. Good GAN training has the discriminator loss hovering near (random guessing). In WGAN, the critic loss is more informative and correlates positively with sample quality.
FID measures distribution similarity, not individual sample quality
Frechet Inception Distance (FID) compares the statistics (mean and covariance of Inception features) of generated and real image sets. A single excellent image does not guarantee low FID. Conversely, low FID does not guarantee every generated image looks good. FID is a distributional metric.
Key Takeaways
- The GAN objective is a minimax game that implicitly minimizes Jensen-Shannon divergence
- The optimal discriminator is a likelihood ratio; at equilibrium, everywhere
- Mode collapse occurs because the generator can exploit the discriminator by specializing
- Wasserstein distance provides gradients even when distributions have disjoint supports
- StyleGAN separates style and content for controllable high-resolution synthesis
- Diffusion models have replaced GANs for most image generation tasks due to training stability and mode coverage
Exercises
Problem
Show that at the Nash equilibrium of the GAN game (), the optimal discriminator outputs for all , and the value of the game is .
Problem
Explain why the standard GAN generator gradient vanishes when the discriminator is too good. Specifically, if for all , what is the gradient ? How does the non-saturating loss help?
References
Canonical:
- Goodfellow et al., "Generative Adversarial Nets" (NeurIPS 2014)
- Arjovsky, Chintala, Bottou, "Wasserstein Generative Adversarial Networks" (ICML 2017)
Current:
-
Karras et al., "A Style-Based Generator Architecture for Generative Adversarial Networks" (StyleGAN, CVPR 2019)
-
Dhariwal and Nichol, "Diffusion Models Beat GANs on Image Synthesis" (NeurIPS 2021)
-
Bishop, Pattern Recognition and Machine Learning (2006), Chapters 1-14
Last reviewed: April 2026
Prerequisites
Foundations this topic depends on.
- Feedforward Networks and BackpropagationLayer 2
- Differentiation in RnLayer 0A
- Sets, Functions, and RelationsLayer 0A
- Basic Logic and Proof TechniquesLayer 0A
- Matrix CalculusLayer 1
- The Jacobian MatrixLayer 0A
- The Hessian MatrixLayer 0A
- Activation FunctionsLayer 1
- Convex Optimization BasicsLayer 1
- Matrix Operations and PropertiesLayer 0A