Beta. Content is under active construction and has not been peer-reviewed. Report errors on GitHub.Disclaimer

Methodology

Synthetic Data Generation

Using models to generate training data: LLM-generated instructions, diffusion-based image augmentation, code synthesis. When synthetic data helps (low-resource, privacy) and when it hurts (model collapse).

AdvancedTier 2Frontier~45 min
0

Why This Matters

Real training data is expensive, private, or scarce. Synthetic data generation uses existing models to produce new training examples. This approach has become central to modern ML: GPT-4 was partially trained on data involving model outputs, Llama models use self-instruct pipelines, and diffusion models generate training images for downstream classifiers.

But synthetic data is not free lunch. Training on model-generated data can degrade model quality in predictable ways. Understanding when synthetic data helps and when it hurts requires understanding the distributional relationship between real and generated data.

Core Definitions

Definition

Synthetic Data

Synthetic data is training data generated by a model rather than collected from the real world. Let prealp_{\text{real}} be the true data distribution and pθp_\theta be a generative model. Synthetic data consists of samples x~pθ\tilde{x} \sim p_\theta. The quality of synthetic data depends on how well pθp_\theta approximates prealp_{\text{real}}.

Definition

Self-Instruct Pipeline

A method for generating instruction-following training data using an LLM. Starting from a small seed set of human-written instructions, the LLM generates new instructions, inputs, and outputs. These are filtered (by heuristics or a separate model) and added to the training set. The process can iterate.

Text: LLM-Generated Instruction Data

The self-instruct method (Wang et al., 2023) and Alpaca (Taori et al., 2023) demonstrated that a strong LLM can generate instruction-following data that, when used for fine-tuning, transfers the instruction-following capability to weaker models.

The pipeline:

  1. Start with a seed set of SS human-written (instruction, input, output) triples
  2. Prompt the teacher LLM to generate new instructions similar to the seeds
  3. For each new instruction, generate input-output pairs
  4. Filter for quality, diversity, and correctness
  5. Fine-tune the student model on the combined dataset

Key empirical finding: the quality of the teacher model bounds the quality of the student. A student fine-tuned on GPT-4 outputs outperforms one fine-tuned on GPT-3.5 outputs. The ceiling is the teacher's capability.

Images: Diffusion-Based Augmentation

Diffusion models generate photorealistic images conditioned on text prompts. For training data augmentation:

  • Generate additional examples for rare classes (e.g., rare medical conditions)
  • Create variations of existing images with controlled modifications
  • Produce images for classes where real data collection is impractical

He et al. (2023) showed that classifiers trained on a mix of real and diffusion-generated images outperform those trained on real data alone, particularly when real data is scarce (fewer than 100 examples per class).

Code: Model-Generated Programming Problems

LLMs generate programming problems with solutions, test cases, and explanations. This is used to train code models (Code Llama, StarCoder). The generation process can include automated verification: run the generated code against the generated tests, and keep only examples that pass.

Verification is a key advantage of code synthesis: correctness is partially checkable, unlike natural language where quality is subjective.

When Synthetic Data Helps

Low-resource domains. When real data has fewer than 1000 examples, synthetic augmentation can substantially improve performance. The synthetic data provides distributional coverage that the small real dataset lacks.

Privacy preservation. Instead of sharing real patient records across hospitals, generate synthetic records that preserve statistical properties without revealing individual data points. Differential privacy guarantees can be applied to the generation process.

Diversity expansion. A small seed dataset may lack coverage of edge cases. Targeted generation can fill gaps: generate examples for underrepresented subgroups, rare events, or boundary conditions.

When Synthetic Data Hurts: Model Collapse

Theorem

Model Collapse from Recursive Training

Statement

Consider a sequence of generative models pθ0,pθ1,pθ2,p_{\theta_0}, p_{\theta_1}, p_{\theta_2}, \ldots where pθ0p_{\theta_0} is trained on real data from prealp_{\text{real}}, and each subsequent model pθk+1p_{\theta_{k+1}} is trained on samples from pθkp_{\theta_k}. As kk \to \infty, the distribution pθkp_{\theta_k} converges to a distribution with lower variance than prealp_{\text{real}}. Specifically, for Gaussian data with true distribution N(μ,σ2)\mathcal{N}(\mu, \sigma^2) and nn samples per generation:

Var(pθk)0 as k\text{Var}(p_{\theta_k}) \to 0 \text{ as } k \to \infty

The tails of the distribution are progressively lost, and the distribution collapses toward a point mass.

Intuition

Each generation of training introduces estimation error. Finite samples from pθkp_{\theta_k} underrepresent the tails of the distribution. The next model, trained on these samples, learns a distribution with thinner tails. Over iterations, the tails vanish entirely. The distribution "forgets" its variance.

Proof Sketch

For the Gaussian case: pθ0p_{\theta_0} is fit to nn real samples, giving an estimated variance σ^02\hat{\sigma}_0^2 with expected value σ2(n1)/n\sigma^2(n-1)/n. Each subsequent generation applies this contraction. After kk generations, the expected variance is σ2((n1)/n)k\sigma^2 \cdot ((n-1)/n)^k, which converges to 0 as kk \to \infty. For non-Gaussian distributions, the argument extends via the finite capacity of the model, which cannot represent the full complexity of the distribution from finite samples.

Why It Matters

Model collapse is the central risk of synthetic data. If the internet increasingly contains LLM-generated text, and LLMs are trained on internet data, this recursive loop will degrade model quality over time. Shumailov et al. (2023) call this "the curse of recursion."

Failure Mode

Model collapse is avoidable if real data is mixed into each generation. Even a small fraction of real data (10-20%) prevents the variance from collapsing. The theorem assumes pure recursive training with no real data, which is the worst case.

Decontamination

When using synthetic data for training, benchmark contamination is a risk: the generating model may produce examples that overlap with evaluation benchmarks. If synthetic training data contains test set answers, reported performance is inflated.

Decontamination procedures:

  1. N-gram overlap filtering: remove synthetic examples with high n-gram overlap with known benchmarks
  2. Embedding similarity filtering: remove examples whose embeddings are close to benchmark items
  3. Provenance tracking: record which model generated each example and what prompt was used

Common Confusions

Watch Out

Synthetic data is not the same as data augmentation

Data augmentation applies label-preserving transformations to real data (rotation, cropping, noise). Synthetic data generation creates entirely new examples, potentially for new classes or scenarios not present in the real data. Augmentation preserves the real distribution. Generation may deviate from it.

Watch Out

Model collapse requires recursive training, not one-time generation

Generating synthetic data once from a model trained on real data does not cause model collapse. The problem arises when model outputs are used to train the next model, whose outputs train the next, and so on. A single generation step introduces bounded error. Recursive application amplifies it.

Canonical Examples

Example

Alpaca: self-instruct at scale

Stanford Alpaca fine-tuned Llama-7B on 52k instruction-following examples generated by GPT-3.5 (text-davinci-003). The cost of generating the training data was under $500. The resulting model showed instruction-following ability comparable to GPT-3.5 on simple tasks, demonstrating that synthetic data can cheaply transfer capabilities between models.

Summary

  • Synthetic data is generated by models, not collected from the real world
  • Self-instruct pipelines use a teacher LLM to generate training data for student models
  • Synthetic data is most valuable in low-resource, privacy-sensitive, or diversity-limited settings
  • Model collapse occurs when models recursively train on their own outputs
  • Mixing even a small fraction of real data prevents collapse
  • Decontamination is necessary to avoid benchmark inflation

Exercises

ExerciseCore

Problem

You have 50 labeled examples of a rare disease in medical images and access to a diffusion model. Describe a pipeline for generating synthetic training data and identify the main risk.

ExerciseAdvanced

Problem

The model collapse theorem shows variance contracts by a factor of (n1)/n(n-1)/n per generation for Gaussian data. If the original variance is σ2=1\sigma^2 = 1 and you use n=1000n = 1000 samples per generation, after how many generations does the variance drop below 0.50.5? Below 0.010.01?

References

Canonical:

  • Wang et al., "Self-Instruct: Aligning Language Models with Self-Generated Instructions" (2023)
  • Taori et al., "Stanford Alpaca" (2023)

Current:

  • Shumailov et al., "The Curse of Recursion: Training on Generated Data Makes Models Forget" (2023)

  • He et al., "Is Synthetic Data from Generative Models Ready for Image Recognition?" (2023)

  • Alemohammad et al., "Self-Consuming Generative Models Go MAD" (2023)

  • Hastie, Tibshirani, Friedman, The Elements of Statistical Learning (2009), Chapters 7-8

Next Topics

Last reviewed: April 2026

Prerequisites

Foundations this topic depends on.

Builds on This

Next Topics