What Each Measures
Both autoregressive (AR) models and diffusion models are generative models that learn to sample from a data distribution . They decompose the generation problem in different ways.
Autoregressive models factor the joint distribution as a product of conditionals:
Generation is sequential: each token (or pixel) is sampled conditioned on all previously generated tokens.
Diffusion models define a forward process that gradually adds Gaussian noise to data over steps until the data becomes pure noise, then learn to reverse this process:
Generation starts from random noise and iteratively denoises to produce a clean sample .
Side-by-Side Statement
Autoregressive Model (Transformer-based)
A transformer decoder models using causal self-attention. Training minimizes cross-entropy (next-token prediction):
This is the exact negative log-likelihood of the data under the model. Generation is strictly left-to-right: sample , then , then , and so on.
Diffusion Model (DDPM)
The forward process adds noise at each step :
The model learns a denoiser that predicts the noise added at step . Training minimizes:
This is a reweighted variational lower bound on the log-likelihood. Generation runs the reverse process from to .
Where Each Is Stronger
Autoregressive wins on text
Text is naturally sequential, discrete, and has strong left-to-right dependencies. The autoregressive factorization aligns perfectly with linguistic structure: the meaning of a word depends on the words before it.
Key advantages for text:
- Exact likelihood: AR models compute exactly, enabling perplexity evaluation and principled model comparison.
- Discrete tokens: Text is a sequence of discrete tokens from a finite vocabulary. Categorical distributions are natural and efficient.
- Variable length: AR models handle variable-length sequences natively (generate until an end-of-sequence token).
- In-context learning: The sequential nature enables few-shot prompting because the model conditions on examples provided in the prefix.
Diffusion wins on images
Images are 2D, continuous, and have spatially local structure without a strong left-to-right ordering. Diffusion models exploit this:
- Parallel generation: At each denoising step, all pixels are updated simultaneously. This is naturally parallelizable on GPUs.
- Continuous data: Pixel values are continuous; the Gaussian noise process and denoising operate in continuous space without discretization artifacts.
- Global coherence: Diffusion generates coarse structure first (low noise levels) and refines details later (high noise levels), naturally producing globally coherent images.
- No ordering problem: Images have no natural serialization order. AR models must impose one (raster scan, spiral, etc.), creating artificial asymmetries.
Autoregressive wins on exact likelihood
AR models compute exactly. This enables:
- Principled model comparison via perplexity
- Exact computation of conditional probabilities
- Lossless compression (arithmetic coding)
Diffusion models optimize a variational lower bound (ELBO), not the exact log-likelihood. Computing the exact requires expensive probability flow ODE integration.
Diffusion wins on sample quality for images
Diffusion models (DALL-E 3, Stable Diffusion, Imagen) produce the highest-quality images, surpassing GANs in both FID scores and human evaluation. The iterative refinement process and the Gaussian noise framework provide stable training without the mode collapse issues of GANs.
Where Each Fails
Autoregressive fails on generation speed
Generating a sequence of length requires sequential forward passes through the model. For long sequences, this is slow. A 4096-token response from an LLM requires 4096 serial steps. Speculative decoding and parallel drafting mitigate this but do not eliminate the fundamental bottleneck.
For images, AR generation is particularly painful: a 256x256 image has 65,536 pixels, requiring 65,536 sequential steps (though recent work operates on compressed token sequences).
Diffusion fails on discrete data
The core mechanism (adding and removing Gaussian noise) is designed for continuous data. Applying diffusion to discrete text requires either:
- Embedding tokens in continuous space (D3PM, Plaid)
- Absorbing-state diffusion (masking tokens)
- Continuous relaxations of discrete variables
None of these approaches matches the quality of AR models on text benchmarks. The discrete nature of language makes the Gaussian noise process a poor fit.
Autoregressive fails on spatial coherence (for images)
AR models must serialize the 2D image into a 1D sequence. The chosen ordering (typically raster scan) means that nearby pixels in 2D can be far apart in the sequence. This makes it harder to maintain spatial coherence. Recent approaches (VQ-VAE tokenization + AR) mitigate this by operating on compressed discrete tokens, but diffusion models still produce more spatially coherent outputs.
Diffusion fails on speed of sampling
Standard diffusion requires hundreds to thousands of denoising steps. Even with acceleration (DDIM, DPM-Solver, consistency models), generation is slower than single-pass methods like GANs. A GAN generates an image in one forward pass; diffusion needs 20 to 50 steps at minimum for good quality.
Different Inductive Biases
The two paradigms encode different assumptions about data structure:
| Autoregressive | Diffusion | |
|---|---|---|
| Factorization | Sequential conditionals | Noise scale hierarchy |
| Data type | Discrete tokens (natural) | Continuous signals (natural) |
| Generation order | Fixed left-to-right | Coarse-to-fine (all positions) |
| Parallelism | Sequential at generation | Parallel at each step |
| Training signal | Exact log-likelihood | Denoising score matching |
| Architecture | Causal transformer | U-Net or DiT |
| Conditioning | Prefix (prompt) | Cross-attention / classifier-free guidance |
The Convergence: Hybrid Approaches
The boundary between AR and diffusion is blurring:
- Discrete diffusion for text: Models like MDLM apply masked diffusion to discrete tokens, enabling parallel text generation at the cost of some quality.
- AR for images via tokenization: VQ-VAE compresses images into discrete token sequences, then AR models generate those tokens. This powers DALL-E 1 and Parti.
- DiT (Diffusion Transformers): Replace the U-Net backbone of diffusion with a transformer, borrowing architectural ideas from AR models.
- Consistency models: Distill diffusion into single-step generators, approaching GAN-like speed with diffusion-like quality.
What to Memorize
- AR = next-token prediction, sequential, exact likelihood, discrete
- Diffusion = iterative denoising, parallel per step, variational bound, continuous
- AR dominates text because language is sequential and discrete
- Diffusion dominates images because images are 2D, continuous, and have no natural ordering
- AR is slow at generation ( serial steps); diffusion is slow too (many denoising steps) but each step is parallel
- Training: AR minimizes cross-entropy; diffusion minimizes denoising MSE (a reweighted ELBO)
When a Researcher Would Use Each
Building a large language model
Use autoregressive transformer (GPT-style). There is no serious competitor for text generation quality. The sequential factorization matches linguistic structure, cross-entropy training is stable, and the resulting model supports in-context learning, chain-of-thought, and instruction following. Every frontier LLM (GPT-4, Claude, LLaMA) is autoregressive.
Generating photorealistic images from text
Use diffusion model (Stable Diffusion, DALL-E 3 style). Diffusion produces the highest-quality images with stable training. Use classifier-free guidance to control the text-image alignment. The iterative refinement naturally handles the multi-scale structure of images.
Video generation
Use diffusion, potentially with temporal AR components. Video is continuous and spatial (favoring diffusion) but also has temporal ordering (favoring AR). Current state-of-the-art models (Sora-class) typically use diffusion in a latent space with some form of temporal conditioning.
Code generation and completion
Use autoregressive transformer. Code is sequential, discrete, and has strong syntactic dependencies. The AR factorization handles these naturally, and in-context learning enables the model to follow instructions and examples in the prompt.
Common Confusions
Diffusion models are not GANs
Diffusion and GANs are both used for image generation, but they differ. GANs train a generator and discriminator in a minimax game; diffusion trains a single denoiser with a simple MSE loss. Diffusion training is much more stable (no mode collapse, no training instability), which is why diffusion has largely replaced GANs.
Autoregressive does not mean recurrent
Modern AR models (GPT, LLaMA) are transformer-based, not RNN-based. The autoregressive property refers to the factorization of the distribution (each token conditioned on previous tokens), not the architecture. The transformer processes all previous tokens in parallel during each step via causal self-attention.
More denoising steps does not always mean better quality
Beyond a threshold (typically 50 to 100 steps with a good solver), additional denoising steps give diminishing returns. The ODE/SDE formulation of diffusion shows that the trajectory is determined by the score function; more steps just approximate this trajectory more finely. Diminishing returns set in quickly with modern solvers like DPM-Solver++.