Train-Test Split and Data Leakage

Sneiderman, Robby

Methodology

Train-Test Split and Data Leakage

Why you need three data splits, how to construct them correctly, and the common ways information leaks from test data into training. Temporal splits, stratified splits, and leakage detection.

CoreTier 1StableSupporting~35 min

Prerequisites

ML Project Lifecycle

Start 8-question practice · 6 available 3-question pulse check Prereq Map

Learning position

Read this page in the graph.

methodology | layer 1 | tier 1. This page has 1 direct prerequisite and 2 published dependents.

Open Atlas Prerequisites Leads to

What next

Cross-Validation Theory

This is the first curated or graph-derived continuation from the current page.

Evidence badge

Claim status

This page has no public Lean mapping yet. Use the evidence page to inspect how claim status labels work.

Show the backing system

AtlasOpen the full prerequisite graph and run grounding traces.EvidenceInspect source support, claim labels, and public trust status.LeanReview the checked declaration list, scopes, and axiom profile.

Why This Matters

The entire point of machine learning is to make predictions on unseen data. If your evaluation uses data that was available during training, your reported metrics are fiction. Data leakage is the most common source of unreproducible ML results, and it is often invisible until deployment.

theorem visual

A split is clean only when information cannot flow backward

$Train, validation, and test sets are not just labels on rows. They are information boundaries. Leakage breaks the boundary and makes offline metrics look better than reality.$

clean boundary

$D_{test} \cap D_{train} = \emptyset$

$The test set must be untouched by fitting, feature engineering, and selection.$

validation role

$D_{val} selects models$

$Validation absorbs tuning pressure so the final test score can stay credible.$

leakage symptom

$R_{offline} < R_{deployment}$

$Offline performance looks suspiciously strong because the evaluation pipeline has a cheat signal.$

Want a larger standalone version of this failure mode? Try the Data Leakage Lab to watch preprocessing leakage, target proxies, and future information inflate offline validation and then collapse on a clean holdout.

Mental Model

You have one dataset. You need to answer three questions:

What hyperparameters work best? (validation set)
How well will the final model perform? (test set)
Where do the model's parameters come from? (training set)

Each question requires data that was not used to answer the other questions.

The Three-Way Split

Definition

Train-Validation-Test Split

Partition a dataset $D$ into three disjoint subsets:

Training set $D_{\text{train}}$ : used to fit model parameters
Validation set $D_{\text{val}}$ : used to select hyperparameters and model architecture
Test set $D_{\text{test}}$ : used as sparingly as possible to estimate final performance — ideally once per project, in practice rarely reused, with the understanding that each additional touch erodes its value as an unbiased estimate

Common ratios: 70/15/15 or 80/10/10. For large datasets ( $n > 100{,}000$ ), even 98/1/1 can suffice because the validation and test sets are still large in absolute terms.

Why Two Held-Out Sets?

If you tune hyperparameters on the test set, you are fitting to the test set. The test error becomes an optimistic estimate of true performance. The validation set absorbs this optimization pressure, keeping the test set uncontaminated.

Stratified Splits

For classification with imbalanced classes, a random split may put few or no examples of a rare class in the validation or test set. A stratified split ensures each subset has approximately the same class distribution as the full dataset.

Temporal Splits for Time Series

Definition

Temporal Split

For data with a time dimension, split by time: train on data before time $t_1$ , validate on data from $t_1$ to $t_2$ , test on data after $t_2$ . This prevents the model from seeing future information during training.

Random splits on time series data cause leakage because adjacent time points are correlated. A model trained on Monday and Wednesday data, tested on Tuesday data, has seen the future.

Data Leakage

Definition

Data Leakage

Data leakage occurs when information from outside the training set is used to create the model. This includes information from the validation set, test set, or real-world future data. Leakage produces overly optimistic performance estimates that do not generalize.

Proposition

Leakage Breaks the Evaluation Boundary

Statement

Let $\widehat R_{\text{offline}}$ be the risk measured by a contaminated evaluation pipeline, and let $R_{\text{deploy}}$ be the risk when the model is evaluated with only prediction-time information. When the leaked feature is predictive in the offline pipeline but absent or invalid at deployment, the offline estimate can be optimistically biased:

$\mathbb{E}[\widehat R_{\text{offline}}] < R_{\text{deploy}}.$

The inequality is not a universal numeric bound; it is the evaluation failure mode. The measured task is no longer the deployment task.

Intuition

Leakage gives the model a cheat sheet. A model can score well on the cheat-sheet evaluation while learning a rule that cannot be used in the real prediction setting.

Proof Sketch

The training and evaluation pipeline has changed the available covariates from $X_{\text{valid}}$ to $(X_{\text{valid}}, X_{\text{leak}})$ . If $X_{\text{leak}}$ contains target or future information, the Bayes risk for the contaminated offline problem can be lower than the Bayes risk for the deployment problem. The reported score is therefore estimating the wrong quantity.

Why It Matters

Leakage can make a bad model look good. A Kaggle competition winner with leakage has zero value in production. Detecting leakage before deployment saves months of wasted engineering effort.

Failure Mode

If the leaking feature is uncorrelated with the target, or the model lacks capacity to exploit it, leakage may have no effect on measured performance. This does not mean the split is clean; it means the leakage happened to be harmless this time.

report a correction →

Common Leakage Sources

1. Preprocessing Before Splitting

Fitting a scaler (mean/standard deviation) or PCA on the full dataset before splitting leaks test set statistics into the training set. The correct procedure: split first, then fit preprocessing on the training set only, then apply the fitted transform to validation and test sets.

2. Target Encoding with Test Data

Target encoding (replacing a categorical feature with the mean of the target for that category) computed on the full dataset leaks target information from the test set into the features.

3. Temporal Leakage

Using features computed from future data. Example: predicting hospital readmission using the patient's next diagnosis code. The next diagnosis is recorded after the readmission event.

4. Duplicate or Near-Duplicate Rows

If the same data point appears in both training and test sets, the model memorizes it and test performance is inflated. This is common with image datasets (augmented versions of the same image) and web-scraped text.

5. Group Leakage

Multiple data points from the same entity (patient, user, device) in both training and test sets. The model learns entity-specific patterns instead of generalizable patterns. Use group-aware splits to keep all data from one entity in the same subset.

Concrete Leakage Examples from Practice

Example

Medical prediction with future diagnosis codes

A hospital wants to predict 30-day readmission at the time of discharge. The feature set includes all diagnosis codes associated with the patient. However, some codes are added to the patient record after readmission (e.g., the admitting diagnosis of the second visit). Including these codes as features gives the model access to information that would not be available at the actual prediction time (discharge). The model achieves 92% AUC in offline evaluation but only 64% AUC in production. The 28-point gap is entirely due to temporal leakage.

Fix: filter all features to include only information available at the prediction time. This requires a careful audit of every feature's timestamp relative to the prediction event.

Example

Kaggle competition leakage via row IDs

In multiple Kaggle competitions, participants discovered that the row ID or file name contained information about the target. For instance, if images were stored as class_label_0001.jpg, the file name leaks the label. Even if the labels are not directly in the file name, sorted row IDs can correlate with time of collection, which correlates with the target through temporal patterns.

A competition to predict earthquake damage had row IDs assigned chronologically. A model that simply learned "later IDs have higher damage" achieved top-10 performance. This is leakage: the ID is a proxy for temporal information that would not be available in a real prediction setting.

Example

Target leakage in churn prediction

Predicting customer churn using features that include "number of calls to customer support in the last 30 days" seems reasonable. But if "churn" is defined as cancellation within 30 days, and the support calls happened after the customer decided to leave (calling to cancel), the feature is a consequence of the target, not a predictor of it. Including it gives the model a near-perfect signal that is unavailable at the time you would make the prediction.

The diagnostic: if a feature's importance is unreasonably high (e.g., a single feature gives 95% of the model's predictive power), suspect leakage. Investigate the causal relationship between the feature and the target.

How to Detect Leakage

Suspiciously high performance: if your model achieves near-perfect accuracy on a non-trivial task, suspect leakage first.
Feature importance analysis: if a feature has unreasonably high importance, check whether it could contain target information.
Adversarial validation: train a classifier to distinguish training from test examples. If it succeeds, the distributions differ in ways that may indicate leakage.
Temporal consistency check: verify that no feature uses information from after the prediction time.

Common Confusions

Watch Out

Cross-validation does not eliminate the need for a test set

Cross-validation reuses data for training and validation efficiently, but the CV estimate is used for model selection. You still need a held-out test set that was never involved in any model selection decision. Otherwise, the CV estimate becomes optimistic.

Watch Out

Larger test sets are not always better

Making the test set larger shrinks the training set, potentially degrading model quality. The test set needs to be large enough for reliable performance estimates (a few thousand examples for classification with moderate class counts) but not so large that it starves training.

Exercises

ExerciseCore

Problem

You have 10,000 samples. You compute PCA on all 10,000 samples to reduce dimensionality, then split into 8,000 train and 2,000 test, then train a classifier. Is there leakage? If so, how do you fix it?

ExerciseAdvanced

Problem

You are predicting next-day stock returns. You randomly shuffle the dataset and split 80/20 into train and test. Your model achieves 65% directional accuracy on the test set. In production, it achieves 51% (barely above chance). Diagnose the problem.

References

Canonical:

Hastie, Tibshirani, Friedman, The Elements of Statistical Learning (2009), Chapter 7.10
Kaufman et al., "Leakage in Data Mining" (2012), ACM TKDD

Current:

Kapoor & Narayanan, "Leakage and the Reproducibility Crisis in ML-based Science" (2023), Patterns
Cawley and Talbot, "On Over-fitting in Model Selection and Subsequent Selection Bias in Performance Evaluation" (2010), JMLR
Shalev-Shwartz & Ben-David, Understanding Machine Learning (2014), Chapters 11-14
Murphy, Machine Learning: A Probabilistic Perspective (2012), Chapters 5-7

Next Topics

Cross-validation theory: resampling methods for more efficient data use
Exploratory data analysis: understanding data before splitting and modeling

Last reviewed: April 26, 2026

Canonical graph

Required before and derived from this topic

These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.

Full prerequisite chain All derived topics

Required prerequisites

1

ML Project Lifecyclelayer 1 · tier 2

Derived topics

2

Exploratory Data Analysislayer 1 · tier 2
Cross-Validation Theorylayer 2 · tier 2

Graph-backed continuations

Cross-Validation Theory Exploratory Data Analysis