Beta. Content is under active construction and has not been peer-reviewed. Report errors on GitHub.Disclaimer

Comparison

Autoregressive Models vs. Diffusion Models

Autoregressive models generate tokens sequentially via next-token prediction; diffusion models generate data by iteratively denoising from Gaussian noise. Sequential discrete generation vs. parallel continuous denoising: why LLMs dominate text and diffusion dominates images.

What Each Measures

Both autoregressive (AR) models and diffusion models are generative models that learn to sample from a data distribution p(x)p(x). They decompose the generation problem in different ways.

Autoregressive models factor the joint distribution as a product of conditionals:

p(x1,x2,,xT)=t=1Tp(xtx1,,xt1)p(x_1, x_2, \ldots, x_T) = \prod_{t=1}^T p(x_t \mid x_1, \ldots, x_{t-1})

Generation is sequential: each token (or pixel) is sampled conditioned on all previously generated tokens.

Diffusion models define a forward process that gradually adds Gaussian noise to data over TT steps until the data becomes pure noise, then learn to reverse this process:

pθ(x0)=p(xT)t=1Tpθ(xt1xt)dx1:Tp_\theta(x_0) = \int p(x_T) \prod_{t=1}^T p_\theta(x_{t-1} \mid x_t) \, dx_{1:T}

Generation starts from random noise xTN(0,I)x_T \sim \mathcal{N}(0, I) and iteratively denoises to produce a clean sample x0x_0.

Side-by-Side Statement

Definition

Autoregressive Model (Transformer-based)

A transformer decoder models p(xtx<t)p(x_t \mid x_{<t}) using causal self-attention. Training minimizes cross-entropy (next-token prediction):

LAR=t=1Tlogpθ(xtx1,,xt1)\mathcal{L}_{\text{AR}} = -\sum_{t=1}^T \log p_\theta(x_t \mid x_1, \ldots, x_{t-1})

This is the exact negative log-likelihood of the data under the model. Generation is strictly left-to-right: sample x1x_1, then x2x1x_2 | x_1, then x3x1,x2x_3 | x_1, x_2, and so on.

Definition

Diffusion Model (DDPM)

The forward process adds noise at each step tt:

q(xtxt1)=N(xt;1βtxt1,βtI)q(x_t \mid x_{t-1}) = \mathcal{N}(x_t; \sqrt{1-\beta_t}\, x_{t-1}, \beta_t I)

The model learns a denoiser ϵθ(xt,t)\epsilon_\theta(x_t, t) that predicts the noise added at step tt. Training minimizes:

Ldiff=Et,x0,ϵ[ϵϵθ(αˉtx0+1αˉtϵ,t)2]\mathcal{L}_{\text{diff}} = \mathbb{E}_{t, x_0, \epsilon}\left[\|\epsilon - \epsilon_\theta(\sqrt{\bar{\alpha}_t}\, x_0 + \sqrt{1-\bar{\alpha}_t}\, \epsilon, \, t)\|^2\right]

This is a reweighted variational lower bound on the log-likelihood. Generation runs the reverse process from xTx_T to x0x_0.

Where Each Is Stronger

Autoregressive wins on text

Text is naturally sequential, discrete, and has strong left-to-right dependencies. The autoregressive factorization aligns perfectly with linguistic structure: the meaning of a word depends on the words before it.

Key advantages for text:

Diffusion wins on images

Images are 2D, continuous, and have spatially local structure without a strong left-to-right ordering. Diffusion models exploit this:

Autoregressive wins on exact likelihood

AR models compute logp(x)=tlogp(xtx<t)\log p(x) = \sum_t \log p(x_t | x_{<t}) exactly. This enables:

Diffusion models optimize a variational lower bound (ELBO), not the exact log-likelihood. Computing the exact logp(x)\log p(x) requires expensive probability flow ODE integration.

Diffusion wins on sample quality for images

Diffusion models (DALL-E 3, Stable Diffusion, Imagen) produce the highest-quality images, surpassing GANs in both FID scores and human evaluation. The iterative refinement process and the Gaussian noise framework provide stable training without the mode collapse issues of GANs.

Where Each Fails

Autoregressive fails on generation speed

Generating a sequence of length TT requires TT sequential forward passes through the model. For long sequences, this is slow. A 4096-token response from an LLM requires 4096 serial steps. Speculative decoding and parallel drafting mitigate this but do not eliminate the fundamental bottleneck.

For images, AR generation is particularly painful: a 256x256 image has 65,536 pixels, requiring 65,536 sequential steps (though recent work operates on compressed token sequences).

Diffusion fails on discrete data

The core mechanism (adding and removing Gaussian noise) is designed for continuous data. Applying diffusion to discrete text requires either:

None of these approaches matches the quality of AR models on text benchmarks. The discrete nature of language makes the Gaussian noise process a poor fit.

Autoregressive fails on spatial coherence (for images)

AR models must serialize the 2D image into a 1D sequence. The chosen ordering (typically raster scan) means that nearby pixels in 2D can be far apart in the sequence. This makes it harder to maintain spatial coherence. Recent approaches (VQ-VAE tokenization + AR) mitigate this by operating on compressed discrete tokens, but diffusion models still produce more spatially coherent outputs.

Diffusion fails on speed of sampling

Standard diffusion requires hundreds to thousands of denoising steps. Even with acceleration (DDIM, DPM-Solver, consistency models), generation is slower than single-pass methods like GANs. A GAN generates an image in one forward pass; diffusion needs 20 to 50 steps at minimum for good quality.

Different Inductive Biases

The two paradigms encode different assumptions about data structure:

AutoregressiveDiffusion
FactorizationSequential conditionalsNoise scale hierarchy
Data typeDiscrete tokens (natural)Continuous signals (natural)
Generation orderFixed left-to-rightCoarse-to-fine (all positions)
ParallelismSequential at generationParallel at each step
Training signalExact log-likelihoodDenoising score matching
ArchitectureCausal transformerU-Net or DiT
ConditioningPrefix (prompt)Cross-attention / classifier-free guidance

The Convergence: Hybrid Approaches

The boundary between AR and diffusion is blurring:

What to Memorize

  1. AR = next-token prediction, sequential, exact likelihood, discrete
  2. Diffusion = iterative denoising, parallel per step, variational bound, continuous
  3. AR dominates text because language is sequential and discrete
  4. Diffusion dominates images because images are 2D, continuous, and have no natural ordering
  5. AR is slow at generation (TT serial steps); diffusion is slow too (many denoising steps) but each step is parallel
  6. Training: AR minimizes cross-entropy; diffusion minimizes denoising MSE (a reweighted ELBO)

When a Researcher Would Use Each

Example

Building a large language model

Use autoregressive transformer (GPT-style). There is no serious competitor for text generation quality. The sequential factorization matches linguistic structure, cross-entropy training is stable, and the resulting model supports in-context learning, chain-of-thought, and instruction following. Every frontier LLM (GPT-4, Claude, LLaMA) is autoregressive.

Example

Generating photorealistic images from text

Use diffusion model (Stable Diffusion, DALL-E 3 style). Diffusion produces the highest-quality images with stable training. Use classifier-free guidance to control the text-image alignment. The iterative refinement naturally handles the multi-scale structure of images.

Example

Video generation

Use diffusion, potentially with temporal AR components. Video is continuous and spatial (favoring diffusion) but also has temporal ordering (favoring AR). Current state-of-the-art models (Sora-class) typically use diffusion in a latent space with some form of temporal conditioning.

Example

Code generation and completion

Use autoregressive transformer. Code is sequential, discrete, and has strong syntactic dependencies. The AR factorization handles these naturally, and in-context learning enables the model to follow instructions and examples in the prompt.

Common Confusions

Watch Out

Diffusion models are not GANs

Diffusion and GANs are both used for image generation, but they differ. GANs train a generator and discriminator in a minimax game; diffusion trains a single denoiser with a simple MSE loss. Diffusion training is much more stable (no mode collapse, no training instability), which is why diffusion has largely replaced GANs.

Watch Out

Autoregressive does not mean recurrent

Modern AR models (GPT, LLaMA) are transformer-based, not RNN-based. The autoregressive property refers to the factorization of the distribution (each token conditioned on previous tokens), not the architecture. The transformer processes all previous tokens in parallel during each step via causal self-attention.

Watch Out

More denoising steps does not always mean better quality

Beyond a threshold (typically 50 to 100 steps with a good solver), additional denoising steps give diminishing returns. The ODE/SDE formulation of diffusion shows that the trajectory is determined by the score function; more steps just approximate this trajectory more finely. Diminishing returns set in quickly with modern solvers like DPM-Solver++.