Methodology
Synthetic Data Generation
Using models to generate training data: LLM-generated instructions, diffusion-based image augmentation, code synthesis. When synthetic data helps (low-resource, privacy) and when it hurts (model collapse).
Prerequisites
Why This Matters
Real training data is expensive, private, or scarce. Synthetic data generation uses existing models to produce new training examples. This approach has become central to modern ML: GPT-4 was partially trained on data involving model outputs, Llama models use self-instruct pipelines, and diffusion models generate training images for downstream classifiers.
But synthetic data is not free lunch. Training on model-generated data can degrade model quality in predictable ways. Understanding when synthetic data helps and when it hurts requires understanding the distributional relationship between real and generated data.
Core Definitions
Synthetic Data
Synthetic data is training data generated by a model rather than collected from the real world. Let be the true data distribution and be a generative model. Synthetic data consists of samples . The quality of synthetic data depends on how well approximates .
Self-Instruct Pipeline
A method for generating instruction-following training data using an LLM. Starting from a small seed set of human-written instructions, the LLM generates new instructions, inputs, and outputs. These are filtered (by heuristics or a separate model) and added to the training set. The process can iterate.
Text: LLM-Generated Instruction Data
The self-instruct method (Wang et al., 2023) and Alpaca (Taori et al., 2023) demonstrated that a strong LLM can generate instruction-following data that, when used for fine-tuning, transfers the instruction-following capability to weaker models.
The pipeline:
- Start with a seed set of human-written (instruction, input, output) triples
- Prompt the teacher LLM to generate new instructions similar to the seeds
- For each new instruction, generate input-output pairs
- Filter for quality, diversity, and correctness
- Fine-tune the student model on the combined dataset
Key empirical finding: the quality of the teacher model bounds the quality of the student. A student fine-tuned on GPT-4 outputs outperforms one fine-tuned on GPT-3.5 outputs. The ceiling is the teacher's capability.
Images: Diffusion-Based Augmentation
Diffusion models generate photorealistic images conditioned on text prompts. For training data augmentation:
- Generate additional examples for rare classes (e.g., rare medical conditions)
- Create variations of existing images with controlled modifications
- Produce images for classes where real data collection is impractical
He et al. (2023) showed that classifiers trained on a mix of real and diffusion-generated images outperform those trained on real data alone, particularly when real data is scarce (fewer than 100 examples per class).
Code: Model-Generated Programming Problems
LLMs generate programming problems with solutions, test cases, and explanations. This is used to train code models (Code Llama, StarCoder). The generation process can include automated verification: run the generated code against the generated tests, and keep only examples that pass.
Verification is a key advantage of code synthesis: correctness is partially checkable, unlike natural language where quality is subjective.
When Synthetic Data Helps
Low-resource domains. When real data has fewer than 1000 examples, synthetic augmentation can substantially improve performance. The synthetic data provides distributional coverage that the small real dataset lacks.
Privacy preservation. Instead of sharing real patient records across hospitals, generate synthetic records that preserve statistical properties without revealing individual data points. Differential privacy guarantees can be applied to the generation process.
Diversity expansion. A small seed dataset may lack coverage of edge cases. Targeted generation can fill gaps: generate examples for underrepresented subgroups, rare events, or boundary conditions.
When Synthetic Data Hurts: Model Collapse
Model Collapse from Recursive Training
Statement
Consider a sequence of generative models where is trained on real data from , and each subsequent model is trained on samples from . As , the distribution converges to a distribution with lower variance than . Specifically, for Gaussian data with true distribution and samples per generation:
The tails of the distribution are progressively lost, and the distribution collapses toward a point mass.
Intuition
Each generation of training introduces estimation error. Finite samples from underrepresent the tails of the distribution. The next model, trained on these samples, learns a distribution with thinner tails. Over iterations, the tails vanish entirely. The distribution "forgets" its variance.
Proof Sketch
For the Gaussian case: is fit to real samples, giving an estimated variance with expected value . Each subsequent generation applies this contraction. After generations, the expected variance is , which converges to 0 as . For non-Gaussian distributions, the argument extends via the finite capacity of the model, which cannot represent the full complexity of the distribution from finite samples.
Why It Matters
Model collapse is the central risk of synthetic data. If the internet increasingly contains LLM-generated text, and LLMs are trained on internet data, this recursive loop will degrade model quality over time. Shumailov et al. (2023) call this "the curse of recursion."
Failure Mode
Model collapse is avoidable if real data is mixed into each generation. Even a small fraction of real data (10-20%) prevents the variance from collapsing. The theorem assumes pure recursive training with no real data, which is the worst case.
Decontamination
When using synthetic data for training, benchmark contamination is a risk: the generating model may produce examples that overlap with evaluation benchmarks. If synthetic training data contains test set answers, reported performance is inflated.
Decontamination procedures:
- N-gram overlap filtering: remove synthetic examples with high n-gram overlap with known benchmarks
- Embedding similarity filtering: remove examples whose embeddings are close to benchmark items
- Provenance tracking: record which model generated each example and what prompt was used
Common Confusions
Synthetic data is not the same as data augmentation
Data augmentation applies label-preserving transformations to real data (rotation, cropping, noise). Synthetic data generation creates entirely new examples, potentially for new classes or scenarios not present in the real data. Augmentation preserves the real distribution. Generation may deviate from it.
Model collapse requires recursive training, not one-time generation
Generating synthetic data once from a model trained on real data does not cause model collapse. The problem arises when model outputs are used to train the next model, whose outputs train the next, and so on. A single generation step introduces bounded error. Recursive application amplifies it.
Canonical Examples
Alpaca: self-instruct at scale
Stanford Alpaca fine-tuned Llama-7B on 52k instruction-following examples generated by GPT-3.5 (text-davinci-003). The cost of generating the training data was under $500. The resulting model showed instruction-following ability comparable to GPT-3.5 on simple tasks, demonstrating that synthetic data can cheaply transfer capabilities between models.
Summary
- Synthetic data is generated by models, not collected from the real world
- Self-instruct pipelines use a teacher LLM to generate training data for student models
- Synthetic data is most valuable in low-resource, privacy-sensitive, or diversity-limited settings
- Model collapse occurs when models recursively train on their own outputs
- Mixing even a small fraction of real data prevents collapse
- Decontamination is necessary to avoid benchmark inflation
Exercises
Problem
You have 50 labeled examples of a rare disease in medical images and access to a diffusion model. Describe a pipeline for generating synthetic training data and identify the main risk.
Problem
The model collapse theorem shows variance contracts by a factor of per generation for Gaussian data. If the original variance is and you use samples per generation, after how many generations does the variance drop below ? Below ?
References
Canonical:
- Wang et al., "Self-Instruct: Aligning Language Models with Self-Generated Instructions" (2023)
- Taori et al., "Stanford Alpaca" (2023)
Current:
-
Shumailov et al., "The Curse of Recursion: Training on Generated Data Makes Models Forget" (2023)
-
He et al., "Is Synthetic Data from Generative Models Ready for Image Recognition?" (2023)
-
Alemohammad et al., "Self-Consuming Generative Models Go MAD" (2023)
-
Hastie, Tibshirani, Friedman, The Elements of Statistical Learning (2009), Chapters 7-8
Next Topics
- Data augmentation theory: formal framework for augmentation as distribution transformation
- Data contamination and evaluation: detecting and preventing benchmark contamination
Last reviewed: April 2026
Prerequisites
Foundations this topic depends on.
- Common Probability DistributionsLayer 0A
- Sets, Functions, and RelationsLayer 0A
- Basic Logic and Proof TechniquesLayer 0A