Generative Adversarial Networks

Sneiderman, Robby

ML Methods

Generative Adversarial Networks

The minimax game between generator and discriminator: Nash equilibrium at the data distribution, mode collapse, the Wasserstein distance fix, StyleGAN, and why diffusion models have largely replaced GANs for image generation.

AdvancedTier 2StableSupporting~60 min

Prerequisites

Feedforward Networks and Backpropagation

Start 8-question practice · 1 available Prereq Map

Learning position

Read this page in the graph.

ml-methods | layer 3 | tier 2. This page has 1 direct prerequisite and 1 published dependent.

Open Atlas Prerequisites Leads to

What next

Deep Generative Models for Cosmic Structures

This is the first curated or graph-derived continuation from the current page.

Evidence badge

Claim status

This page has no public Lean mapping yet. Use the evidence page to inspect how claim status labels work.

Show the backing system

AtlasOpen the full prerequisite graph and run grounding traces.EvidenceInspect source support, claim labels, and public trust status.LeanReview the checked declaration list, scopes, and axiom profile.

Why This Matters

GANs introduced a new approach to generative modeling: instead of explicitly modeling a density, train a generator to fool a discriminator. Both networks are trained via backpropagation. This adversarial training produced the first photorealistic synthetic images and remains influential despite being largely superseded by diffusion models for image generation.

Understanding GANs matters because the adversarial framework appears throughout ML: domain adaptation, data augmentation, robustness testing, and the discriminator concept underpins many evaluation methods (FID uses features from a classifier trained in a GAN-like setup).

Mental Model

Two networks play a game. The generator $G$ takes random noise $z$ and produces a fake sample $G(z)$ . The discriminator $D$ receives either a real sample $x$ or a fake sample $G(z)$ and tries to distinguish them. The generator wins when the discriminator cannot tell real from fake. At equilibrium, the generator produces samples indistinguishable from the real data distribution.

The Minimax Objective

Definition

GAN Objective $V (G, D)$

The GAN training objective is:

$\min_G \max_D V(G, D) = \mathbb{E}_{x \sim p_{\text{data}}}[\log D(x)] + \mathbb{E}_{z \sim p_z}[\log(1 - D(G(z)))]$

The discriminator $D(x) \in [0, 1]$ outputs the probability that $x$ is real. The discriminator maximizes $V$ (correctly classify real and fake). The generator minimizes $V$ (make the discriminator wrong).

Optimal Discriminator and Nash Equilibrium

Theorem

Optimal Discriminator

Statement

For a fixed generator $G$ , the optimal discriminator is:

$D^*(x) = \frac{p_{\text{data}}(x)}{p_{\text{data}}(x) + p_G(x)}$

where $p_G$ is the density induced by $G$ . Substituting $D^*$ back into the objective gives:

$V(G, D^*) = 2 \cdot \text{JSD}(p_{\text{data}} \| p_G) - \log 4$

where JSD is the Jensen-Shannon divergence. The global minimum is achieved when $p_G = p_{\text{data}}$ , giving $V = -\log 4$ and $D^*(x) = 1/2$ everywhere.

Intuition

The optimal discriminator is a likelihood ratio test. When the generator perfectly matches the data distribution, no discriminator can do better than random guessing ( $D = 1/2$ ). The objective reduces to measuring the Jensen-Shannon divergence between real and generated distributions.

Proof Sketch

For fixed $G$ , the integrand of $V$ at each point $x$ is $p_{\text{data}}(x) \log D(x) + p_G(x) \log(1 - D(x))$ . This is maximized over $D(x) \in [0,1]$ by taking the derivative and setting it to zero, giving $D^*(x) = p_{\text{data}}(x)/(p_{\text{data}}(x) + p_G(x))$ . Substituting and using the definition of JSD yields the result.

Why It Matters

This theorem shows that GAN training implicitly minimizes the Jensen-Shannon divergence (related to KL divergence) between the generated and real distributions. This connects the adversarial game to a well-defined statistical divergence, giving the framework a theoretical foundation.

Failure Mode

The proof assumes unlimited discriminator capacity and continuous densities. In practice, the discriminator is a finite neural network that may not approximate $D^*$ well. When $p_{\text{data}}$ and $p_G$ have disjoint supports (common in high dimensions), the JSD saturates at $\log 2$ , providing zero gradient to the generator. This is the vanishing gradient problem of GANs.

report a correction →

Mode Collapse

Definition

Mode Collapse

The generator produces samples from only a few modes of the data distribution, ignoring the rest. For example, a GAN trained on digits might produce excellent 3s and 7s but never generate 0s or 5s.

Mode collapse occurs because the generator can reduce the discriminator's accuracy by perfecting a few modes rather than covering all modes. The minimax solution (where $G$ plays first) is $p_G = p_{\text{data}}$ , but the maximin solution (where $D$ plays first) can concentrate on a single mode. In practice, alternating gradient descent does not guarantee convergence to the minimax solution.

No fully reliable solution to mode collapse exists. Techniques that help: minibatch discrimination (let the discriminator see batches, not individual samples), spectral normalization, and diverse training objectives.

Wasserstein GAN

Proposition

Wasserstein GAN Objective

Statement

The Wasserstein-1 (earth mover's) distance between two distributions is:

$W_1(p_{\text{data}}, p_G) = \sup_{\|f\|_L \leq 1} \mathbb{E}_{x \sim p_{\text{data}}}[f(x)] - \mathbb{E}_{x \sim p_G}[f(x)]$

where the supremum is over all 1-Lipschitz functions $f$ . The WGAN objective replaces the discriminator with a critic $f$ (no sigmoid, no probability output) and maximizes this over $f$ while minimizing over $G$ .

Intuition

The earth mover's distance measures the minimum cost of transporting mass from $p_G$ to $p_{\text{data}}$ . Unlike JSD, it provides a meaningful gradient even when the distributions have disjoint supports. If $p_G$ is far from $p_{\text{data}}$ , the Wasserstein distance is large and its gradient points toward $p_{\text{data}}$ .

Proof Sketch

This is the Kantorovich-Rubinstein dual representation of the Wasserstein distance. The primal formulation minimizes the expected transport cost over all couplings of $p_{\text{data}}$ and $p_G$ . The dual formulation replaces the coupling with a Lipschitz function. The Lipschitz constraint ensures the dual is bounded.

Why It Matters

WGAN training is more stable than standard GAN training because the Wasserstein distance does not saturate when distributions are far apart. The critic loss correlates with sample quality, unlike the standard GAN discriminator loss, which can oscillate while generation quality improves.

Failure Mode

Enforcing the Lipschitz constraint is difficult. Weight clipping (the original WGAN approach) limits the critic's capacity and can cause training instabilities. Gradient penalty (WGAN-GP) penalizes the gradient norm deviating from 1, which works better but adds computational cost. Spectral normalization is another approach.

report a correction →

StyleGAN

StyleGAN (Karras et al., 2019) produces high-resolution photorealistic images by separating style from content. The latent code $z$ is first mapped to an intermediate space $w$ via a mapping network. The style vector $w$ is injected at each layer of the generator via adaptive instance normalization (AdaIN), controlling coarse features (pose, shape) at early layers and fine features (texture, color) at later layers.

Progressive growing (training at increasing resolutions) and skip connections enable stable training at resolutions up to 1024x1024. StyleGAN2 and StyleGAN3 further improved quality and removed artifacts.

Why Diffusion Models Replaced GANs

For image generation, diffusion models have largely superseded GANs since 2021. The reasons:

Training stability: Diffusion models optimize a simple denoising objective with no adversarial dynamics. No mode collapse, no training oscillation.
Mode coverage: Diffusion models naturally cover all modes of the distribution because the denoising objective is a weighted sum over all noise levels.
Composability: Classifier-free guidance and conditioning are straightforward in diffusion models. GAN conditioning requires architectural modifications.
Sample quality: Diffusion models now match or exceed GAN quality on standard benchmarks (FID on ImageNet).

GANs retain advantages in inference speed (one forward pass vs. many denoising steps) and in discriminative tasks where the adversarial framework is natural.

Common Confusions

Watch Out

GANs do not learn a density

Unlike VAEs or normalizing flows, GANs do not produce an explicit density $p_G(x)$ . You can sample from the generator but cannot evaluate the probability of a given sample. This makes GANs unsuitable for tasks that require density evaluation (e.g., anomaly detection via likelihood).

Watch Out

Discriminator loss going to zero does not mean training succeeded

In standard GAN training, the discriminator loss approaching zero often means the discriminator is winning too easily: it perfectly distinguishes real from fake. This indicates the generator is failing. Good GAN training has the discriminator loss hovering near $\log 2$ (random guessing). In WGAN, the critic loss is more informative and correlates positively with sample quality.

Watch Out

FID measures distribution similarity, not individual sample quality

Frechet Inception Distance (FID) compares the statistics (mean and covariance of Inception features) of generated and real image sets. A single excellent image does not guarantee low FID. Conversely, low FID does not guarantee every generated image looks good. FID is a distributional metric.

Summary

The GAN objective is a minimax game that implicitly minimizes Jensen-Shannon divergence
The optimal discriminator is a likelihood ratio; at equilibrium, $D^*(x) = 1/2$ everywhere
Mode collapse occurs because the generator can exploit the discriminator by specializing
Wasserstein distance provides gradients even when distributions have disjoint supports
StyleGAN separates style and content for controllable high-resolution synthesis
Diffusion models have replaced GANs for most image generation tasks due to training stability and mode coverage

Exercises

ExerciseCore

Problem

Show that at the Nash equilibrium of the GAN game ( $p_G = p_{\text{data}}$ ), the optimal discriminator outputs $D^*(x) = 1/2$ for all $x$ , and the value of the game is $V = -\log 4$ .

ExerciseAdvanced

Problem

Explain why the standard GAN generator gradient vanishes when the discriminator is too good. Specifically, if $D(G(z)) \approx 0$ for all $z$ , what is the gradient $\nabla_\theta \mathbb{E}[\log(1 - D(G(z)))]$ ? How does the non-saturating loss $-\mathbb{E}[\log D(G(z))]$ help?

References

Canonical:

Goodfellow et al., "Generative Adversarial Nets" (NeurIPS 2014)
Arjovsky, Chintala, Bottou, "Wasserstein Generative Adversarial Networks" (ICML 2017)

Current:

Karras et al., "A Style-Based Generator Architecture for Generative Adversarial Networks" (StyleGAN, CVPR 2019)
Dhariwal and Nichol, "Diffusion Models Beat GANs on Image Synthesis" (NeurIPS 2021)

Last reviewed: April 14, 2026

Canonical graph

Required before and derived from this topic

These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.

Full prerequisite chain All derived topics

Required prerequisites

1

Feedforward Networks and Backpropagationlayer 2 · tier 1

Derived topics

1

Deep Generative Models for Cosmic Structureslayer 4 · tier 3

Graph-backed continuations

Deep Generative Models for Cosmic Structures