Data Contamination and Evaluation

Sneiderman, Robby

AI Safety

Data Contamination and Evaluation

When training data overlaps test benchmarks, model scores become meaningless. Types of contamination, detection methods, dynamic benchmarks, and how to read evaluation claims skeptically.

AdvancedTier 2FrontierSupporting~55 min

Prerequisites

Hypothesis Testing for ML Benchmarking Methodology Model Collapse and Data Quality Synthetic Data Generation

Start 8-question practice · 3 available Prereq Map

Learning position

Read this page in the graph.

ai-safety | layer 5 | tier 2. This page has 4 direct prerequisites and 1 published dependent.

Open Atlas Prerequisites Leads to

What next

Hallucination Theory

This is the first curated or graph-derived continuation from the current page.

Evidence badge

Claim status

This page has no public Lean mapping yet. Use the evidence page to inspect how claim status labels work.

Show the backing system

AtlasOpen the full prerequisite graph and run grounding traces.EvidenceInspect source support, claim labels, and public trust status.LeanReview the checked declaration list, scopes, and axiom profile.

Why This Matters

Every week someone announces a new model that "beats GPT-4 on MMLU" or "achieves state-of-the-art on HumanEval." How much should you believe these claims? The answer depends almost entirely on whether the benchmark data leaked into the training set.

Data contamination is the single biggest threat to meaningful evaluation in modern ML. When a model has seen the test questions during training, its benchmark score measures memorization, not capability. And because LLMs train on massive web crawls that include benchmark datasets, contamination is the default state unless actively prevented.

Mental Model

Think of contamination like a student who has seen the exact exam questions before the test. Their score tells you nothing about whether they understand the material. They might score 95% on the exam and fail completely on a slightly rephrased version of the same questions.

The same happens with LLMs. A model contaminated on MMLU can score high on the exact MMLU questions while performing poorly on equivalent questions drawn from the same distribution but not in the training set.

Core Definitions

Definition

Data Contamination

Data contamination occurs when data from a test or evaluation benchmark appears in the model's training data. The model may memorize or partially memorize test examples, leading to inflated performance estimates that do not reflect true generalization ability.

Definition

Direct Memorization

Direct memorization is the strongest form of contamination: the exact test example (input and correct output) appears verbatim in the training corpus. The model can achieve a correct answer by recalling the memorized pair rather than reasoning about the question.

Definition

Paraphrase Contamination

Paraphrase contamination occurs when the training data contains a rephrased version of a test example. The input or output is not identical, but similar enough that the model gains an unfair advantage. This is harder to detect than direct memorization because string matching fails.

Definition

Data Leakage

Data leakage is a broader concept: any information from the test distribution that enters the training pipeline in a way that inflates evaluation scores. This includes contamination but also includes cases where auxiliary data (feature engineering signals, label-correlated metadata) leaks target information into the training features.

Types of Contamination

Contamination exists on a spectrum:

Verbatim inclusion: The exact benchmark example appears in the training data. Detectable by n-gram matching or substring search.
Near-duplicate: Minor formatting differences (whitespace, punctuation) but semantically identical. Detectable by fuzzy matching or edit distance.
Paraphrase: Same question reworded. Requires semantic similarity detection. Hard to catch automatically.
Indirect leakage: The training data contains solutions manuals, homework answer keys, or forum posts that discuss the benchmark questions. The model never sees the benchmark directly but learns patterns that transfer to it.
Benchmark-on-the-web: Many benchmarks are published on the open web (GitHub, HuggingFace, papers). Any web crawl large enough will include them unless explicitly filtered.

How Contamination Inflates Scores

Proposition

Contamination Inflation Bound

Statement

If a model has memorized a fraction $f$ of the test set, answering memorized examples correctly with probability $p_m$ and non-memorized examples with probability $p_g$ , the observed test accuracy is:

$\text{Acc}_{\text{observed}} = f \cdot p_m + (1 - f) \cdot p_g$

The true generalization accuracy is $p_g$ . The inflation is:

$\Delta = f \cdot (p_m - p_g)$

Even modest contamination ( $f = 0.1$ ) with strong memorization ( $p_m = 1.0$ ) and moderate true ability ( $p_g = 0.6$ ) inflates the score by 4 percentage points: from 60% to 64%.

Intuition

Contamination acts as a mixture: the model's score is a weighted average of its memorization ability on leaked examples and its genuine ability on clean examples. The more contaminated the test set and the better the model memorizes, the more inflated the score becomes.

Why It Matters

A 4-point inflation on a benchmark where models are separated by 1-2 points can change the entire leaderboard ranking. This means that without contamination controls, benchmark rankings are largely meaningless for comparing models trained on different data.

Failure Mode

This simple model assumes uniform memorization probability across contaminated examples. In practice, memorization depends on how many times the example appeared in training and its similarity to other training data. Heavily duplicated examples are memorized more strongly.

report a correction →

Detection Methods

N-gram Overlap

The simplest detection method: for each test example, check whether long n-grams (typically 8-13 grams) appear in the training data. If a 13-gram from a test question appears verbatim in training, it is likely contaminated.

Limitations: fails for paraphrase contamination and misses near-duplicates with minor edits.

Canary Strings

Definition

Canary String

A canary string is a unique, randomly generated string inserted into a dataset. If a model can reproduce the canary string when prompted, the dataset was included in its training data. Canaries provide a definitive yes/no answer to the question "was this data in the training set?"

Canary strings are used proactively: benchmark creators embed canaries in their datasets before public release. If a model trained later can complete the canary, its training data included the benchmark.

Membership Inference

Given a test example, determine whether it was in the training set using the model's behavior. Common signals:

Perplexity: contaminated examples tend to have lower perplexity (the model is more confident on memorized data)
Calibration: contaminated examples show overconfident predictions
Completion probability: contaminated examples have higher probability of exact completion

These methods are probabilistic and can produce false positives, but they work even without access to the training data.

Temporal Splits

Definition

Temporal Split

A temporal split uses publication date as a natural contamination barrier. Test examples created after the model's training data cutoff cannot be contaminated (assuming the training data cutoff is honest). This is the most robust contamination prevention method.

Dynamic vs Static Benchmarks

Static benchmarks (MMLU, HumanEval, GSM8K) are published once and never change. They are contaminated by default in any model trained on web data collected after the benchmark's publication date.

Dynamic benchmarks regenerate their test examples periodically or on demand. Examples:

LiveCodeBench: Continuously sources new coding problems from competitive programming sites after a cutoff date
Private holdout sets: Companies maintain private eval sets never published on the web
Procedurally generated: Generate new instances from templates at evaluation time

Dynamic benchmarks resist contamination because the test examples do not exist in any training corpus at training time.

Current Checkpoint

As of May 2026, contamination control has shifted from "did we deduplicate the benchmark?" to "can the benchmark stay useful after it becomes public?" Recent work separates three families of defenses:

Data updating: refresh the test distribution with new examples that postdate the model's training cutoff.
Data rewriting: transform seed tasks into semantically equivalent variants so memorized surface forms stop helping.
Prevention: keep private holdouts, canaries, and server-side scoring so the actual test items never enter public corpora.

For code models, dynamic benchmark generators are especially useful because a seed programming problem can be rewritten while preserving the underlying algorithmic logic. That does not prove a benchmark is contamination-free, but it changes the question from "has the exact item leaked?" to "does the model solve an equivalent fresh instance?"

Build It This Way by Default

For product and research claims, report two numbers when possible: the public benchmark score and a private or freshly generated score from the same skill family. If those diverge, treat the public number as a leaderboard signal, not as an estimate of user-facing capability.

Common Fake Understanding

"My model beats GPT-4 on MMLU" often means nothing. MMLU has been on the internet since 2021. Every major web crawl since then includes it. Any model trained on Common Crawl, The Pile, or similar internet-scale corpora has likely seen MMLU questions unless explicit deduplication was performed. A high MMLU score on a new model, without a contamination analysis, is not evidence of superior capability. It might be evidence of more aggressive web crawling. The only credible benchmark comparisons control for contamination or use dynamic benchmarks with temporal splits.

How to Read Benchmark Claims Skeptically

When evaluating a model release or paper, ask:

Was contamination measured? Does the paper report contamination rates for each benchmark? If not, assume contamination is present.
How was the training data filtered? Was deduplication performed against the benchmark? What method was used?
Are results consistent across clean and contaminated subsets? If performance drops sharply on a decontaminated subset, the headline number is inflated.
Is the benchmark static or dynamic? Static benchmark results from models trained on internet data are suspect by default.
Does the reported score match independent reproduction? Third-party evaluations on private holdout sets are more credible than self-reported numbers.

Prevention Strategies

Deduplication against benchmarks: Before training, remove all n-gram matches between the training corpus and known benchmark datasets. This catches verbatim and near-duplicate contamination but misses paraphrases.

Canary insertion: Embed canary strings in benchmark releases. Monitor for canary reproduction in new models.

Temporal splits: Design benchmarks with a strict cutoff date. Only evaluate models whose training data predates the benchmark.

Private holdout sets: Maintain evaluation sets that are never published. Only the evaluation API is exposed, not the questions. OpenAI's internal evals and Anthropic's internal evals follow this pattern.

Renewal: Periodically retire contaminated benchmarks and replace them with fresh ones drawn from the same distribution.

Common Confusions

Watch Out

Contamination is not the same as overfitting

Overfitting means the model fits training noise and performs poorly on the test distribution. Contamination means the test data is training data. A contaminated model may generalize poorly to new examples from the same distribution while scoring well on the leaked examples. These are different failure modes with different solutions.

Watch Out

Deduplication does not eliminate all contamination

N-gram deduplication catches exact and near-exact matches. It misses paraphrase contamination, indirect leakage through answer keys, and contamination through similar examples that are not direct copies. Thorough contamination prevention requires multiple complementary approaches.

Watch Out

High benchmark scores are not automatically suspicious

Contamination is common, but not universal. A model can genuinely achieve high benchmark scores through superior architecture, training data quality, or training methodology. The point is not that high scores are fake, but that without contamination analysis you cannot distinguish genuine capability from memorization.

Summary

Data contamination is when training data overlaps with evaluation benchmarks
Types: verbatim, near-duplicate, paraphrase, indirect leakage
Even 10% contamination can change leaderboard rankings
Static benchmarks on the open web are contaminated by default
Detection: n-gram matching, canary strings, membership inference, temporal splits
Prevention: deduplication, temporal splits, dynamic benchmarks, private holdouts
Always ask: was contamination measured? If not, the scores are unreliable.

Exercises

ExerciseCore

Problem

A benchmark has 1000 questions. You discover that 80 of them appear verbatim in the model's training data. On the 80 contaminated questions, the model scores 95%. On the remaining 920, it scores 68%. What is the reported benchmark accuracy, and what is the true generalization accuracy?

ExerciseAdvanced

Problem

You are designing a coding benchmark that will resist contamination for at least 2 years. Describe your design, including: (a) how you source problems, (b) how you prevent leakage, (c) how you detect contamination in models evaluated on your benchmark.

ExerciseResearch

Problem

Paraphrase contamination is the hardest form to detect because the test example never appears verbatim in training data. Propose a method to estimate the rate of paraphrase contamination in a model, given access to the model weights and a test set but not the full training corpus.

References

Canonical:

Jacovi et al., "Stop Uploading Test Data in Plain Text: Practical Strategies for Mitigating Data Contamination by Evaluation Benchmarks" (2023)
Sainz et al., "NLP Evaluation in Trouble: On the Need to Measure LLM Data Contamination for Each Benchmark" (2023)

Current:

Cheng et al., "A Survey on Data Contamination for Large Language Models" (2025, arXiv:2502.14425) — current taxonomy of updating, rewriting, prevention, and detection methods.
Chen et al., "Dynamic Benchmarking of Reasoning Capabilities in Code Large Language Models Under Data Contamination" (ICML 2025, arXiv:2503.04149) — code benchmark generation by transforming seed problems while preserving logic.
Jain et al., "LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code" (2024)
Yang et al., "Rethinking Benchmark and Contamination for Language Models with Rephrased Samples" (2024)

Next Topics

The natural next steps from data contamination:

Hallucination theory: when models produce confident wrong answers, sometimes due to contamination-induced overconfidence
Scaling laws: understanding how benchmark performance scales with compute, independent of contamination effects

Last reviewed: May 25, 2026

Canonical graph

Required before and derived from this topic

These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.

Full prerequisite chain All derived topics

Required prerequisites

4

Hypothesis Testing for MLlayer 2 · tier 2
Synthetic Data Generationlayer 3 · tier 2
Model Collapse and Data Qualitylayer 5 · tier 2
Benchmarking Methodologylayer 3 · tier 3

Derived topics

1

Scaling Lawslayer 4 · tier 1

Graph-backed continuations

Scaling Laws