Beta. Content is under active construction and has not been peer-reviewed. Report errors on GitHub.Disclaimer

ML Methods

Generative Adversarial Networks

The minimax game between generator and discriminator: Nash equilibrium at the data distribution, mode collapse, the Wasserstein distance fix, StyleGAN, and why diffusion models have largely replaced GANs for image generation.

AdvancedTier 2Stable~60 min

Why This Matters

GANs introduced a new approach to generative modeling: instead of explicitly modeling a density, train a generator to fool a discriminator. Both networks are trained via backpropagation. This adversarial training produced the first photorealistic synthetic images and remains influential despite being largely superseded by diffusion models for image generation.

Understanding GANs matters because the adversarial framework appears throughout ML: domain adaptation, data augmentation, robustness testing, and the discriminator concept underpins many evaluation methods (FID uses features from a classifier trained in a GAN-like setup).

Mental Model

Two networks play a game. The generator GG takes random noise zz and produces a fake sample G(z)G(z). The discriminator DD receives either a real sample xx or a fake sample G(z)G(z) and tries to distinguish them. The generator wins when the discriminator cannot tell real from fake. At equilibrium, the generator produces samples indistinguishable from the real data distribution.

The Minimax Objective

Definition

GAN Objective

The GAN training objective is:

minGmaxDV(G,D)=Expdata[logD(x)]+Ezpz[log(1D(G(z)))]\min_G \max_D V(G, D) = \mathbb{E}_{x \sim p_{\text{data}}}[\log D(x)] + \mathbb{E}_{z \sim p_z}[\log(1 - D(G(z)))]

The discriminator D(x)[0,1]D(x) \in [0, 1] outputs the probability that xx is real. The discriminator maximizes VV (correctly classify real and fake). The generator minimizes VV (make the discriminator wrong).

Optimal Discriminator and Nash Equilibrium

Theorem

Optimal Discriminator

Statement

For a fixed generator GG, the optimal discriminator is:

D(x)=pdata(x)pdata(x)+pG(x)D^*(x) = \frac{p_{\text{data}}(x)}{p_{\text{data}}(x) + p_G(x)}

where pGp_G is the density induced by GG. Substituting DD^* back into the objective gives:

V(G,D)=2JSD(pdatapG)log4V(G, D^*) = 2 \cdot \text{JSD}(p_{\text{data}} \| p_G) - \log 4

where JSD is the Jensen-Shannon divergence. The global minimum is achieved when pG=pdatap_G = p_{\text{data}}, giving V=log4V = -\log 4 and D(x)=1/2D^*(x) = 1/2 everywhere.

Intuition

The optimal discriminator is a likelihood ratio test. When the generator perfectly matches the data distribution, no discriminator can do better than random guessing (D=1/2D = 1/2). The objective reduces to measuring the Jensen-Shannon divergence between real and generated distributions.

Proof Sketch

For fixed GG, the integrand of VV at each point xx is pdata(x)logD(x)+pG(x)log(1D(x))p_{\text{data}}(x) \log D(x) + p_G(x) \log(1 - D(x)). This is maximized over D(x)[0,1]D(x) \in [0,1] by taking the derivative and setting it to zero, giving D(x)=pdata(x)/(pdata(x)+pG(x))D^*(x) = p_{\text{data}}(x)/(p_{\text{data}}(x) + p_G(x)). Substituting and using the definition of JSD yields the result.

Why It Matters

This theorem shows that GAN training implicitly minimizes the Jensen-Shannon divergence (related to KL divergence) between the generated and real distributions. This connects the adversarial game to a well-defined statistical divergence, giving the framework a theoretical foundation.

Failure Mode

The proof assumes unlimited discriminator capacity and continuous densities. In practice, the discriminator is a finite neural network that may not approximate DD^* well. When pdatap_{\text{data}} and pGp_G have disjoint supports (common in high dimensions), the JSD saturates at log2\log 2, providing zero gradient to the generator. This is the vanishing gradient problem of GANs.

Mode Collapse

Definition

Mode Collapse

The generator produces samples from only a few modes of the data distribution, ignoring the rest. For example, a GAN trained on digits might produce excellent 3s and 7s but never generate 0s or 5s.

Mode collapse occurs because the generator can reduce the discriminator's accuracy by perfecting a few modes rather than covering all modes. The minimax solution (where GG plays first) is pG=pdatap_G = p_{\text{data}}, but the maximin solution (where DD plays first) can concentrate on a single mode. In practice, alternating gradient descent does not guarantee convergence to the minimax solution.

No fully reliable solution to mode collapse exists. Techniques that help: minibatch discrimination (let the discriminator see batches, not individual samples), spectral normalization, and diverse training objectives.

Wasserstein GAN

Proposition

Wasserstein GAN Objective

Statement

The Wasserstein-1 (earth mover's) distance between two distributions is:

W1(pdata,pG)=supfL1Expdata[f(x)]ExpG[f(x)]W_1(p_{\text{data}}, p_G) = \sup_{\|f\|_L \leq 1} \mathbb{E}_{x \sim p_{\text{data}}}[f(x)] - \mathbb{E}_{x \sim p_G}[f(x)]

where the supremum is over all 1-Lipschitz functions ff. The WGAN objective replaces the discriminator with a critic ff (no sigmoid, no probability output) and maximizes this over ff while minimizing over GG.

Intuition

The earth mover's distance measures the minimum cost of transporting mass from pGp_G to pdatap_{\text{data}}. Unlike JSD, it provides a meaningful gradient even when the distributions have disjoint supports. If pGp_G is far from pdatap_{\text{data}}, the Wasserstein distance is large and its gradient points toward pdatap_{\text{data}}.

Proof Sketch

This is the Kantorovich-Rubinstein dual representation of the Wasserstein distance. The primal formulation minimizes the expected transport cost over all couplings of pdatap_{\text{data}} and pGp_G. The dual formulation replaces the coupling with a Lipschitz function. The Lipschitz constraint ensures the dual is bounded.

Why It Matters

WGAN training is more stable than standard GAN training because the Wasserstein distance does not saturate when distributions are far apart. The critic loss correlates with sample quality, unlike the standard GAN discriminator loss, which can oscillate while generation quality improves.

Failure Mode

Enforcing the Lipschitz constraint is difficult. Weight clipping (the original WGAN approach) limits the critic's capacity and can cause training instabilities. Gradient penalty (WGAN-GP) penalizes the gradient norm deviating from 1, which works better but adds computational cost. Spectral normalization is another approach.

StyleGAN

StyleGAN (Karras et al., 2019) produces high-resolution photorealistic images by separating style from content. The latent code zz is first mapped to an intermediate space ww via a mapping network. The style vector ww is injected at each layer of the generator via adaptive instance normalization (AdaIN), controlling coarse features (pose, shape) at early layers and fine features (texture, color) at later layers.

Progressive growing (training at increasing resolutions) and skip connections enable stable training at resolutions up to 1024x1024. StyleGAN2 and StyleGAN3 further improved quality and removed artifacts.

Why Diffusion Models Replaced GANs

For image generation, diffusion models have largely superseded GANs since 2021. The reasons:

  1. Training stability: Diffusion models optimize a simple denoising objective with no adversarial dynamics. No mode collapse, no training oscillation.
  2. Mode coverage: Diffusion models naturally cover all modes of the distribution because the denoising objective is a weighted sum over all noise levels.
  3. Composability: Classifier-free guidance and conditioning are straightforward in diffusion models. GAN conditioning requires architectural modifications.
  4. Sample quality: Diffusion models now match or exceed GAN quality on standard benchmarks (FID on ImageNet).

GANs retain advantages in inference speed (one forward pass vs. many denoising steps) and in discriminative tasks where the adversarial framework is natural.

Common Confusions

Watch Out

GANs do not learn a density

Unlike VAEs or normalizing flows, GANs do not produce an explicit density pG(x)p_G(x). You can sample from the generator but cannot evaluate the probability of a given sample. This makes GANs unsuitable for tasks that require density evaluation (e.g., anomaly detection via likelihood).

Watch Out

Discriminator loss going to zero does not mean training succeeded

In standard GAN training, the discriminator loss approaching zero often means the discriminator is winning too easily: it perfectly distinguishes real from fake. This indicates the generator is failing. Good GAN training has the discriminator loss hovering near log2\log 2 (random guessing). In WGAN, the critic loss is more informative and correlates positively with sample quality.

Watch Out

FID measures distribution similarity, not individual sample quality

Frechet Inception Distance (FID) compares the statistics (mean and covariance of Inception features) of generated and real image sets. A single excellent image does not guarantee low FID. Conversely, low FID does not guarantee every generated image looks good. FID is a distributional metric.

Key Takeaways

  • The GAN objective is a minimax game that implicitly minimizes Jensen-Shannon divergence
  • The optimal discriminator is a likelihood ratio; at equilibrium, D(x)=1/2D^*(x) = 1/2 everywhere
  • Mode collapse occurs because the generator can exploit the discriminator by specializing
  • Wasserstein distance provides gradients even when distributions have disjoint supports
  • StyleGAN separates style and content for controllable high-resolution synthesis
  • Diffusion models have replaced GANs for most image generation tasks due to training stability and mode coverage

Exercises

ExerciseCore

Problem

Show that at the Nash equilibrium of the GAN game (pG=pdatap_G = p_{\text{data}}), the optimal discriminator outputs D(x)=1/2D^*(x) = 1/2 for all xx, and the value of the game is V=log4V = -\log 4.

ExerciseAdvanced

Problem

Explain why the standard GAN generator gradient vanishes when the discriminator is too good. Specifically, if D(G(z))0D(G(z)) \approx 0 for all zz, what is the gradient θE[log(1D(G(z)))]\nabla_\theta \mathbb{E}[\log(1 - D(G(z)))]? How does the non-saturating loss E[logD(G(z))]-\mathbb{E}[\log D(G(z))] help?

References

Canonical:

  • Goodfellow et al., "Generative Adversarial Nets" (NeurIPS 2014)
  • Arjovsky, Chintala, Bottou, "Wasserstein Generative Adversarial Networks" (ICML 2017)

Current:

  • Karras et al., "A Style-Based Generator Architecture for Generative Adversarial Networks" (StyleGAN, CVPR 2019)

  • Dhariwal and Nichol, "Diffusion Models Beat GANs on Image Synthesis" (NeurIPS 2021)

  • Bishop, Pattern Recognition and Machine Learning (2006), Chapters 1-14

Last reviewed: April 2026

Prerequisites

Foundations this topic depends on.