AI Safety
Model Collapse and Data Quality
When models train on their own outputs, the learned distribution narrows, tails disappear, and quality degrades across generations. Why synthetic data feedback loops threaten pretraining data quality and how to mitigate them.
Prerequisites
Why This Matters
As LLMs produce an increasing fraction of text on the internet, future pretraining datasets will inevitably contain AI-generated content. If models are trained on the outputs of previous models, the resulting distribution shifts in predictable and harmful ways: variance decreases, tails disappear, minority modes vanish. This is model collapse.
This is not a hypothetical concern. Web crawls from 2024 onward contain substantial AI-generated text. Any pretraining pipeline that ingests web data without careful filtering is at risk.
Mental Model
Imagine a game of telephone where each participant is a language model. The first model learns from real human data. The second model learns from the first model's outputs. The third learns from the second's outputs. At each step, the learned distribution becomes a noisy approximation of the previous one. Rare events (unusual phrasing, minority viewpoints, tail-distribution examples) get progressively smoothed out because the approximation concentrates around the mode.
Formal Setup
Let be the true data distribution (human-generated text). A model trained on samples from learns an approximation . Model is trained on samples from , learning . In general, is trained on samples from .
Model Collapse
Model collapse is the progressive degradation of the learned distribution as increases in an iterative training loop where each generation of model trains on the outputs of the previous generation. The distribution narrows, variance decreases, and support shrinks.
Iterative Retraining
In iterative retraining, generation of a model is trained on data sampled from generation :
This models the scenario where AI-generated text progressively replaces human-generated text in training corpora.
Main Theorems
Variance Decay Under Iterative Retraining
Statement
Suppose and each generation fits a Gaussian via MLE on samples from the previous generation. After generations, the expected variance of satisfies:
As , . The distribution collapses to a point mass.
Intuition
Each generation estimates the variance from a finite sample, which systematically underestimates the true variance (by the factor from the MLE bias). This bias compounds across generations. With infinite samples (), each generation would recover the previous distribution exactly and no collapse would occur. Finite sampling is the root cause.
Proof Sketch
The MLE variance estimator from samples has expectation . At generation , the true variance being estimated is , so in expectation . Iterating gives the geometric decay .
Why It Matters
This is the simplest demonstration that iterative retraining on synthetic data causes systematic quality loss. The Gaussian case is exactly solvable, but the phenomenon, that finite-sample estimation errors compound across generations, applies far more broadly. Shumailov et al. (2024) confirm the same pattern empirically in language models and diffusion models.
Failure Mode
The Gaussian analysis understates the problem. In higher dimensions and with more complex distributions, the modes of the distribution can disappear entirely (not just shrink), because the model fails to generate enough samples from rare modes for the next generation to learn them.
Tails Vanish Under Iterative Retraining
Statement
Let be a mixture of Gaussians with weights . Under iterative retraining with samples per generation, a component with weight is represented in the sample with probability . After generations, the probability that a minority component (small ) survives in the training data decays exponentially in . Components with vanish within a few generations.
Intuition
If a mixture component has weight 1%, and you draw 100 samples, you expect only 1 sample from that component. In the next generation, the model trained on those samples may assign even less weight to that component. Within a few generations, the component produces zero samples and is permanently lost from the data distribution.
Why It Matters
Tail disappearance means that rare but valid text (minority dialects, specialized technical content, unusual creative writing) is systematically removed from the training distribution across generations. The resulting models produce more homogeneous, more "average" text with less diversity.
Failure Mode
This analysis assumes independent sampling at each generation. In practice, deduplication and filtering steps may accelerate tail loss. Conversely, targeted oversampling of rare content can slow it.
Empirical Evidence
Shumailov et al. (2024) demonstrated model collapse empirically across multiple architectures:
- Language models (OPT-125M): After 5 generations of iterative retraining, perplexity increased and text diversity (measured by self-BLEU) decreased. The models produced increasingly repetitive output.
- Variational autoencoders: On MNIST, iterative retraining caused the model to lose minority digit classes within 5-10 generations.
- Gaussian mixtures: The number of recovered modes decreased monotonically with generation count.
The pattern is consistent: iterative retraining degrades quality, reduces diversity, and eliminates tails.
Mitigation Strategies
Data provenance tracking. Label data as human-generated or AI-generated at the point of creation. Maintain metadata throughout the data pipeline. Prioritize human-generated data in pretraining mixtures.
Decontamination. Use classifiers (GPTZero, DetectGPT, watermark detectors) to identify and remove AI-generated content from training corpora. This is imperfect because detection accuracy degrades as models improve.
Maintaining human data sources. Preserve access to pre-LLM web crawls (Common Crawl snapshots from before 2022). Curate high-quality human-written datasets (books, peer-reviewed papers, pre-LLM Wikipedia). Weight these sources more heavily in the training mixture.
Mixing real and synthetic data. Rather than training purely on synthetic data, maintain a minimum fraction of real data in every training batch. Shumailov et al. show that even real data substantially slows collapse.
Data diversity enforcement. During synthetic data generation, use temperature scaling, nucleus sampling, or explicit diversity objectives to ensure the generated data covers the full distribution, not just the mode.
Common Confusions
Model collapse is not catastrophic forgetting
Catastrophic forgetting occurs when a model trained on task A loses performance on task A after fine-tuning on task B. Model collapse is a different phenomenon: the model's training data distribution narrows across generations, even when the task stays the same. The cause is iterative retraining on synthetic data, not task switching.
Model collapse does not require the same model
The collapse occurs even when different architectures are used across generations. What matters is that each generation trains on the previous generation's outputs. The distribution narrowing is a property of the data pipeline, not any specific model architecture.
Small amounts of synthetic data do not cause collapse
Using synthetic data to augment a primarily human-generated training set is different from iterative retraining. A single generation of synthetic data mixed with real data does not produce collapse. The problem requires multiple generations where each generation's output becomes the next generation's input.
Summary
- Iterative retraining on synthetic data causes variance decay at rate
- Tails and minority modes vanish within a few generations
- The root cause is finite-sample estimation error compounding across generations
- Mitigation: data provenance, decontamination, preserving human data sources
- Even 10% real data in the training mix substantially slows collapse
- This is a systemic risk as AI-generated text saturates the web
Exercises
Problem
A Gaussian distribution undergoes iterative retraining with samples per generation. After generations, what is the expected variance? After how many generations does the expected variance drop below 50?
Problem
A training corpus is a mixture of 95% AI-generated text and 5% human-generated text. You train a model on this corpus, then use that model to generate the AI portion of the next corpus (keeping the 5% human data fixed). Model the AI-generated text as drawn from the model's learned distribution. After 10 generations, qualitatively describe what happens to the overall distribution. Does the 5% human data prevent collapse?
References
Canonical:
- Shumailov et al., "The Curse of Recursion: Training on Generated Data Makes Models Forget" (2024, Nature)
Current:
- Alemohammad et al., "Self-Consuming Generative Models Go MAD" (2023, ICML)
- Dohmatob et al., "A Tale of Tails: Model Collapse as a Change of Scaling Laws" (2024)
- Briesch et al., "Large Language Models Suffer From Their Own Output" (2023)
Next Topics
The natural next steps from model collapse:
- Data contamination and evaluation: detecting AI-generated content in benchmarks
Last reviewed: April 2026
Prerequisites
Foundations this topic depends on.
- Synthetic Data GenerationLayer 3
- Common Probability DistributionsLayer 0A
- Sets, Functions, and RelationsLayer 0A
- Basic Logic and Proof TechniquesLayer 0A