Beta. Content is under active construction and has not been peer-reviewed. Report errors on GitHub.Disclaimer

AI Safety

Model Collapse and Data Quality

When models train on their own outputs, the learned distribution narrows, tails disappear, and quality degrades across generations. Why synthetic data feedback loops threaten pretraining data quality and how to mitigate them.

AdvancedTier 2Frontier~45 min
0

Why This Matters

As LLMs produce an increasing fraction of text on the internet, future pretraining datasets will inevitably contain AI-generated content. If models are trained on the outputs of previous models, the resulting distribution shifts in predictable and harmful ways: variance decreases, tails disappear, minority modes vanish. This is model collapse.

This is not a hypothetical concern. Web crawls from 2024 onward contain substantial AI-generated text. Any pretraining pipeline that ingests web data without careful filtering is at risk.

Mental Model

Imagine a game of telephone where each participant is a language model. The first model learns from real human data. The second model learns from the first model's outputs. The third learns from the second's outputs. At each step, the learned distribution becomes a noisy approximation of the previous one. Rare events (unusual phrasing, minority viewpoints, tail-distribution examples) get progressively smoothed out because the approximation concentrates around the mode.

Formal Setup

Let p0p_0 be the true data distribution (human-generated text). A model M0M_0 trained on samples from p0p_0 learns an approximation p^0\hat{p}_0. Model M1M_1 is trained on samples from p^0\hat{p}_0, learning p^1\hat{p}_1. In general, MkM_k is trained on samples from p^k1\hat{p}_{k-1}.

Definition

Model Collapse

Model collapse is the progressive degradation of the learned distribution p^k\hat{p}_k as kk increases in an iterative training loop where each generation of model trains on the outputs of the previous generation. The distribution narrows, variance decreases, and support shrinks.

Definition

Iterative Retraining

In iterative retraining, generation kk of a model is trained on data sampled from generation k1k-1:

p^k=Train({xi}i=1nk),xip^k1\hat{p}_k = \text{Train}(\{x_i\}_{i=1}^{n_k}), \quad x_i \sim \hat{p}_{k-1}

This models the scenario where AI-generated text progressively replaces human-generated text in training corpora.

Main Theorems

Theorem

Variance Decay Under Iterative Retraining

Statement

Suppose p0=N(μ,σ2)p_0 = \mathcal{N}(\mu, \sigma^2) and each generation fits a Gaussian via MLE on nn samples from the previous generation. After kk generations, the expected variance of p^k\hat{p}_k satisfies:

E[σ^k2]=σ2(n1n)k\mathbb{E}[\hat{\sigma}_k^2] = \sigma^2 \cdot \left(\frac{n-1}{n}\right)^k

As kk \to \infty, σ^k20\hat{\sigma}_k^2 \to 0. The distribution collapses to a point mass.

Intuition

Each generation estimates the variance from a finite sample, which systematically underestimates the true variance (by the factor (n1)/n(n-1)/n from the MLE bias). This bias compounds across generations. With infinite samples (nn \to \infty), each generation would recover the previous distribution exactly and no collapse would occur. Finite sampling is the root cause.

Proof Sketch

The MLE variance estimator from nn samples has expectation n1nσtrue2\frac{n-1}{n} \sigma_{\text{true}}^2. At generation kk, the true variance being estimated is σ^k12\hat{\sigma}_{k-1}^2, so in expectation σ^k2=n1nσ^k12\hat{\sigma}_k^2 = \frac{n-1}{n} \hat{\sigma}_{k-1}^2. Iterating gives the geometric decay (n1n)k\left(\frac{n-1}{n}\right)^k.

Why It Matters

This is the simplest demonstration that iterative retraining on synthetic data causes systematic quality loss. The Gaussian case is exactly solvable, but the phenomenon, that finite-sample estimation errors compound across generations, applies far more broadly. Shumailov et al. (2024) confirm the same pattern empirically in language models and diffusion models.

Failure Mode

The Gaussian analysis understates the problem. In higher dimensions and with more complex distributions, the modes of the distribution can disappear entirely (not just shrink), because the model fails to generate enough samples from rare modes for the next generation to learn them.

Proposition

Tails Vanish Under Iterative Retraining

Statement

Let p0p_0 be a mixture of mm Gaussians with weights w1,,wmw_1, \ldots, w_m. Under iterative retraining with nn samples per generation, a component with weight wjw_j is represented in the sample with probability 1(1wj)n1 - (1 - w_j)^n. After kk generations, the probability that a minority component (small wjw_j) survives in the training data decays exponentially in kk. Components with wj1/nw_j \ll 1/n vanish within a few generations.

Intuition

If a mixture component has weight 1%, and you draw 100 samples, you expect only 1 sample from that component. In the next generation, the model trained on those samples may assign even less weight to that component. Within a few generations, the component produces zero samples and is permanently lost from the data distribution.

Why It Matters

Tail disappearance means that rare but valid text (minority dialects, specialized technical content, unusual creative writing) is systematically removed from the training distribution across generations. The resulting models produce more homogeneous, more "average" text with less diversity.

Failure Mode

This analysis assumes independent sampling at each generation. In practice, deduplication and filtering steps may accelerate tail loss. Conversely, targeted oversampling of rare content can slow it.

Empirical Evidence

Shumailov et al. (2024) demonstrated model collapse empirically across multiple architectures:

  • Language models (OPT-125M): After 5 generations of iterative retraining, perplexity increased and text diversity (measured by self-BLEU) decreased. The models produced increasingly repetitive output.
  • Variational autoencoders: On MNIST, iterative retraining caused the model to lose minority digit classes within 5-10 generations.
  • Gaussian mixtures: The number of recovered modes decreased monotonically with generation count.

The pattern is consistent: iterative retraining degrades quality, reduces diversity, and eliminates tails.

Mitigation Strategies

Data provenance tracking. Label data as human-generated or AI-generated at the point of creation. Maintain metadata throughout the data pipeline. Prioritize human-generated data in pretraining mixtures.

Decontamination. Use classifiers (GPTZero, DetectGPT, watermark detectors) to identify and remove AI-generated content from training corpora. This is imperfect because detection accuracy degrades as models improve.

Maintaining human data sources. Preserve access to pre-LLM web crawls (Common Crawl snapshots from before 2022). Curate high-quality human-written datasets (books, peer-reviewed papers, pre-LLM Wikipedia). Weight these sources more heavily in the training mixture.

Mixing real and synthetic data. Rather than training purely on synthetic data, maintain a minimum fraction α\alpha of real data in every training batch. Shumailov et al. show that even α=10%\alpha = 10\% real data substantially slows collapse.

Data diversity enforcement. During synthetic data generation, use temperature scaling, nucleus sampling, or explicit diversity objectives to ensure the generated data covers the full distribution, not just the mode.

Common Confusions

Watch Out

Model collapse is not catastrophic forgetting

Catastrophic forgetting occurs when a model trained on task A loses performance on task A after fine-tuning on task B. Model collapse is a different phenomenon: the model's training data distribution narrows across generations, even when the task stays the same. The cause is iterative retraining on synthetic data, not task switching.

Watch Out

Model collapse does not require the same model

The collapse occurs even when different architectures are used across generations. What matters is that each generation trains on the previous generation's outputs. The distribution narrowing is a property of the data pipeline, not any specific model architecture.

Watch Out

Small amounts of synthetic data do not cause collapse

Using synthetic data to augment a primarily human-generated training set is different from iterative retraining. A single generation of synthetic data mixed with real data does not produce collapse. The problem requires multiple generations where each generation's output becomes the next generation's input.

Summary

  • Iterative retraining on synthetic data causes variance decay at rate (11/n)k(1 - 1/n)^k
  • Tails and minority modes vanish within a few generations
  • The root cause is finite-sample estimation error compounding across generations
  • Mitigation: data provenance, decontamination, preserving human data sources
  • Even 10% real data in the training mix substantially slows collapse
  • This is a systemic risk as AI-generated text saturates the web

Exercises

ExerciseCore

Problem

A Gaussian distribution N(0,100)\mathcal{N}(0, 100) undergoes iterative retraining with n=1000n = 1000 samples per generation. After k=100k = 100 generations, what is the expected variance? After how many generations does the expected variance drop below 50?

ExerciseAdvanced

Problem

A training corpus is a mixture of 95% AI-generated text and 5% human-generated text. You train a model on this corpus, then use that model to generate the AI portion of the next corpus (keeping the 5% human data fixed). Model the AI-generated text as drawn from the model's learned distribution. After 10 generations, qualitatively describe what happens to the overall distribution. Does the 5% human data prevent collapse?

References

Canonical:

  • Shumailov et al., "The Curse of Recursion: Training on Generated Data Makes Models Forget" (2024, Nature)

Current:

  • Alemohammad et al., "Self-Consuming Generative Models Go MAD" (2023, ICML)
  • Dohmatob et al., "A Tale of Tails: Model Collapse as a Change of Scaling Laws" (2024)
  • Briesch et al., "Large Language Models Suffer From Their Own Output" (2023)

Next Topics

The natural next steps from model collapse:

Last reviewed: April 2026

Prerequisites

Foundations this topic depends on.

Next Topics