Autoregressive vs. Diffusion Models

What Each Measures

Both autoregressive (AR) models and diffusion models are generative models that learn to sample from a data distribution $p(x)$ . They decompose the generation problem in different ways.

Autoregressive models factor the joint distribution as a product of conditionals:

$p(x_1, x_2, \ldots, x_T) = \prod_{t=1}^T p(x_t \mid x_1, \ldots, x_{t-1})$

Generation is sequential: each token (or pixel) is sampled conditioned on all previously generated tokens.

Diffusion models define a forward process that gradually adds Gaussian noise to data over $T$ steps until the data becomes pure noise, then learn to reverse this process:

$p_\theta(x_0) = \int p(x_T) \prod_{t=1}^T p_\theta(x_{t-1} \mid x_t) \, dx_{1:T}$

Generation starts from random noise $x_T \sim \mathcal{N}(0, I)$ and iteratively denoises to produce a clean sample $x_0$ .

Side-by-Side Statement

Definition

Autoregressive Model (Transformer-based)

A transformer decoder models $p(x_t \mid x_{<t})$ using causal self-attention. Training minimizes cross-entropy (next-token prediction):

$\mathcal{L}_{\text{AR}} = -\sum_{t=1}^T \log p_\theta(x_t \mid x_1, \ldots, x_{t-1})$

This is the exact negative log-likelihood of the data under the model. Generation is strictly left-to-right: sample $x_1$ , then $x_2 | x_1$ , then $x_3 | x_1, x_2$ , and so on.

Definition

Diffusion Model (DDPM)

The forward process adds noise at each step $t$ :

$q(x_t \mid x_{t-1}) = \mathcal{N}(x_t; \sqrt{1-\beta_t}\, x_{t-1}, \beta_t I)$

The model learns a denoiser $\epsilon_\theta(x_t, t)$ that predicts the noise added at step $t$ . Training minimizes:

$\mathcal{L}_{\text{diff}} = \mathbb{E}_{t, x_0, \epsilon}\left[\|\epsilon - \epsilon_\theta(\sqrt{\bar{\alpha}_t}\, x_0 + \sqrt{1-\bar{\alpha}_t}\, \epsilon, \, t)\|^2\right]$

This is a reweighted variational lower bound on the log-likelihood. Generation runs the reverse process from $x_T$ to $x_0$ .

Where Each Is Stronger

Autoregressive wins on text

Text is naturally sequential, discrete, and has strong left-to-right dependencies. The autoregressive factorization aligns perfectly with linguistic structure: the meaning of a word depends on the words before it.

Key advantages for text:

Exact likelihood: AR models compute $\log p(x)$ exactly, enabling perplexity evaluation and principled model comparison.
Discrete tokens: Text is a sequence of discrete tokens from a finite vocabulary. Categorical distributions are natural and efficient.
Variable length: AR models handle variable-length sequences natively (generate until an end-of-sequence token).
In-context learning: The sequential nature enables few-shot prompting because the model conditions on examples provided in the prefix.

Diffusion wins on images

Images are 2D, continuous, and have spatially local structure without a strong left-to-right ordering. Diffusion models exploit this:

Parallel generation: At each denoising step, all pixels are updated simultaneously. This is naturally parallelizable on GPUs.
Continuous data: Pixel values are continuous; the Gaussian noise process and denoising operate in continuous space without discretization artifacts.
Global coherence: Diffusion generates coarse structure first (low noise levels) and refines details later (high noise levels), naturally producing globally coherent images.
No ordering problem: Images have no natural serialization order. AR models must impose one (raster scan, spiral, etc.), creating artificial asymmetries.

Autoregressive wins on exact likelihood

AR models compute $\log p(x) = \sum_t \log p(x_t | x_{<t})$ exactly. This enables:

Principled model comparison via perplexity
Exact computation of conditional probabilities
Lossless compression (arithmetic coding)

Diffusion models optimize a variational lower bound (ELBO), not the exact log-likelihood. Computing the exact $\log p(x)$ requires expensive probability flow ODE integration.

Diffusion wins on sample quality for images

Diffusion models (DALL-E 3, Stable Diffusion, Imagen) produce the highest-quality images, surpassing GANs in both FID scores and human evaluation. The iterative refinement process and the Gaussian noise framework provide stable training without the mode collapse issues of GANs.

Where Each Fails

Autoregressive fails on generation speed

Generating a sequence of length $T$ requires $T$ sequential forward passes through the model. For long sequences, this is slow. A 4096-token response from an LLM requires 4096 serial steps. Speculative decoding and parallel drafting mitigate this but do not eliminate the fundamental bottleneck.

For images, AR generation is particularly painful: a 256x256 image has 65,536 pixels, requiring 65,536 sequential steps (though recent work operates on compressed token sequences).

Diffusion fails on discrete data

The core mechanism (adding and removing Gaussian noise) is designed for continuous data. Applying diffusion to discrete text requires either:

Embedding tokens in continuous space (D3PM, Plaid)
Absorbing-state diffusion (masking tokens)
Continuous relaxations of discrete variables

None of these approaches matches the quality of AR models on text benchmarks. The discrete nature of language makes the Gaussian noise process a poor fit.

Autoregressive fails on spatial coherence (for images)

AR models must serialize the 2D image into a 1D sequence. The chosen ordering (typically raster scan) means that nearby pixels in 2D can be far apart in the sequence. This makes it harder to maintain spatial coherence. Recent approaches (VQ-VAE tokenization + AR) mitigate this by operating on compressed discrete tokens, but diffusion models still produce more spatially coherent outputs.

Diffusion fails on speed of sampling

Standard diffusion requires hundreds to thousands of denoising steps. Even with acceleration (DDIM, DPM-Solver, consistency models), generation is slower than single-pass methods like GANs. A GAN generates an image in one forward pass; diffusion needs 20 to 50 steps at minimum for good quality.

Different Inductive Biases

The two paradigms encode different assumptions about data structure:

	Autoregressive	Diffusion
Factorization	Sequential conditionals	Noise scale hierarchy
Data type	Discrete tokens (natural)	Continuous signals (natural)
Generation order	Fixed left-to-right	Coarse-to-fine (all positions)
Parallelism	Sequential at generation	Parallel at each step
Training signal	Exact log-likelihood	Denoising score matching
Architecture	Causal transformer	U-Net or DiT
Conditioning	Prefix (prompt)	Cross-attention / classifier-free guidance

The Convergence: Hybrid Approaches

The boundary between AR and diffusion is blurring:

Discrete diffusion for text: Models like MDLM apply masked diffusion to discrete tokens, enabling parallel text generation at the cost of some quality.
AR for images via tokenization: VQ-VAE compresses images into discrete token sequences, then AR models generate those tokens. This powers DALL-E 1 and Parti.
DiT (Diffusion Transformers): Replace the U-Net backbone of diffusion with a transformer, borrowing architectural ideas from AR models.
Consistency models: Distill diffusion into single-step generators, approaching GAN-like speed with diffusion-like quality.

What to Memorize

AR = next-token prediction, sequential, exact likelihood, discrete
Diffusion = iterative denoising, parallel per step, variational bound, continuous
AR dominates text because language is sequential and discrete
Diffusion dominates images because images are 2D, continuous, and have no natural ordering
AR is slow at generation ( $T$ serial steps); diffusion is slow too (many denoising steps) but each step is parallel
Training: AR minimizes cross-entropy; diffusion minimizes denoising MSE (a reweighted ELBO)

When a Researcher Would Use Each

Example

Building a large language model

Use autoregressive transformer (GPT-style). There is no serious competitor for text generation quality. The sequential factorization matches linguistic structure, cross-entropy training is stable, and the resulting model supports in-context learning, chain-of-thought, and instruction following. Every frontier LLM (GPT-4, Claude, LLaMA) is autoregressive.

Example

Generating photorealistic images from text

Use diffusion model (Stable Diffusion, DALL-E 3 style). Diffusion produces the highest-quality images with stable training. Use classifier-free guidance to control the text-image alignment. The iterative refinement naturally handles the multi-scale structure of images.

Example

Video generation

Use diffusion, potentially with temporal AR components. Video is continuous and spatial (favoring diffusion) but also has temporal ordering (favoring AR). Current state-of-the-art models (Sora-class) typically use diffusion in a latent space with some form of temporal conditioning.

Example

Code generation and completion

Use autoregressive transformer. Code is sequential, discrete, and has strong syntactic dependencies. The AR factorization handles these naturally, and in-context learning enables the model to follow instructions and examples in the prompt.

Common Confusions

Watch Out

Diffusion models are not GANs

Diffusion and GANs are both used for image generation, but they differ. GANs train a generator and discriminator in a minimax game; diffusion trains a single denoiser with a simple MSE loss. Diffusion training is much more stable (no mode collapse, no training instability), which is why diffusion has largely replaced GANs.

Watch Out

Autoregressive does not mean recurrent

Modern AR models (GPT, LLaMA) are transformer-based, not RNN-based. The autoregressive property refers to the factorization of the distribution (each token conditioned on previous tokens), not the architecture. The transformer processes all previous tokens in parallel during each step via causal self-attention.

Watch Out

More denoising steps does not always mean better quality

Beyond a threshold (typically 50 to 100 steps with a good solver), additional denoising steps give diminishing returns. The ODE/SDE formulation of diffusion shows that the trajectory is determined by the score function; more steps just approximate this trajectory more finely. Diminishing returns set in quickly with modern solvers like DPM-Solver++.