Methodology
Train-Test Split and Data Leakage
Why you need three data splits, how to construct them correctly, and the common ways information leaks from test data into training. Temporal splits, stratified splits, and leakage detection.
Why This Matters
The entire point of machine learning is to make predictions on unseen data. If your evaluation uses data that was available during training, your reported metrics are fiction. Data leakage is the most common source of unreproducible ML results, and it is often invisible until deployment.
Mental Model
You have one dataset. You need to answer three questions:
- What hyperparameters work best? (validation set)
- How well will the final model perform? (test set)
- Where do the model's parameters come from? (training set)
Each question requires data that was not used to answer the other questions.
The Three-Way Split
Train-Validation-Test Split
Partition a dataset into three disjoint subsets:
- Training set : used to fit model parameters
- Validation set : used to select hyperparameters and model architecture
- Test set : used exactly once to estimate final performance
Common ratios: 70/15/15 or 80/10/10. For large datasets (), even 98/1/1 can suffice because the validation and test sets are still large in absolute terms.
Why Two Held-Out Sets?
If you tune hyperparameters on the test set, you are fitting to the test set. The test error becomes an optimistic estimate of true performance. The validation set absorbs this optimization pressure, keeping the test set uncontaminated.
Stratified Splits
For classification with imbalanced classes, a random split may put few or no examples of a rare class in the validation or test set. A stratified split ensures each subset has approximately the same class distribution as the full dataset.
Temporal Splits for Time Series
Temporal Split
For data with a time dimension, split by time: train on data before time , validate on data from to , test on data after . This prevents the model from seeing future information during training.
Random splits on time series data cause leakage because adjacent time points are correlated. A model trained on Monday and Wednesday data, tested on Tuesday data, has seen the future.
Data Leakage
Data Leakage
Data leakage occurs when information from outside the training set is used to create the model. This includes information from the validation set, test set, or real-world future data. Leakage produces overly optimistic performance estimates that do not generalize.
Leakage Inflates Apparent Performance
Statement
Let be the empirical risk measured on a contaminated test set and be the population risk on clean data. If the leakage provides bits of information about the label, then:
where is a gap that grows with the information leaked. The apparent performance is strictly better than the true performance .
Intuition
Leakage gives the model a cheat sheet. Performance on the cheat-sheet evaluation is better than performance on a fair evaluation. The more information leaked, the larger the gap between reported and actual performance.
Proof Sketch
The leaking feature provides mutual information about the target. Any model that exploits this feature achieves lower empirical risk on the contaminated evaluation than on clean data. By Fano's inequality, the excess information reduces the Bayes error on the contaminated set relative to the clean set.
Why It Matters
Leakage can make a bad model look good. A Kaggle competition winner with leakage has zero value in production. Detecting leakage before deployment saves months of wasted engineering effort.
Failure Mode
If the leaking feature is uncorrelated with the target, or the model lacks capacity to exploit it, leakage may have no effect on measured performance. This does not mean the split is clean; it means the leakage happened to be harmless this time.
Common Leakage Sources
1. Preprocessing Before Splitting
Fitting a scaler (mean/standard deviation) or PCA on the full dataset before splitting leaks test set statistics into the training set. The correct procedure: split first, then fit preprocessing on the training set only, then apply the fitted transform to validation and test sets.
2. Target Encoding with Test Data
Target encoding (replacing a categorical feature with the mean of the target for that category) computed on the full dataset leaks target information from the test set into the features.
3. Temporal Leakage
Using features computed from future data. Example: predicting hospital readmission using the patient's next diagnosis code. The next diagnosis is recorded after the readmission event.
4. Duplicate or Near-Duplicate Rows
If the same data point appears in both training and test sets, the model memorizes it and test performance is inflated. This is common with image datasets (augmented versions of the same image) and web-scraped text.
5. Group Leakage
Multiple data points from the same entity (patient, user, device) in both training and test sets. The model learns entity-specific patterns instead of generalizable patterns. Use group-aware splits to keep all data from one entity in the same subset.
Concrete Leakage Examples from Practice
Medical prediction with future diagnosis codes
A hospital wants to predict 30-day readmission at the time of discharge. The feature set includes all diagnosis codes associated with the patient. However, some codes are added to the patient record after readmission (e.g., the admitting diagnosis of the second visit). Including these codes as features gives the model access to information that would not be available at the actual prediction time (discharge). The model achieves 92% AUC in offline evaluation but only 64% AUC in production. The 28-point gap is entirely due to temporal leakage.
Fix: filter all features to include only information available at the prediction time. This requires a careful audit of every feature's timestamp relative to the prediction event.
Kaggle competition leakage via row IDs
In multiple Kaggle competitions, participants discovered that the row ID or file name contained information about the target. For instance, if images were stored as class_label_0001.jpg, the file name leaks the label. Even if the labels are not directly in the file name, sorted row IDs can correlate with time of collection, which correlates with the target through temporal patterns.
A competition to predict earthquake damage had row IDs assigned chronologically. A model that simply learned "later IDs have higher damage" achieved top-10 performance. This is leakage: the ID is a proxy for temporal information that would not be available in a real prediction setting.
Target leakage in churn prediction
Predicting customer churn using features that include "number of calls to customer support in the last 30 days" seems reasonable. But if "churn" is defined as cancellation within 30 days, and the support calls happened after the customer decided to leave (calling to cancel), the feature is a consequence of the target, not a predictor of it. Including it gives the model a near-perfect signal that is unavailable at the time you would make the prediction.
The diagnostic: if a feature's importance is unreasonably high (e.g., a single feature gives 95% of the model's predictive power), suspect leakage. Investigate the causal relationship between the feature and the target.
How to Detect Leakage
- Suspiciously high performance: if your model achieves near-perfect accuracy on a non-trivial task, suspect leakage first.
- Feature importance analysis: if a feature has unreasonably high importance, check whether it could contain target information.
- Adversarial validation: train a classifier to distinguish training from test examples. If it succeeds, the distributions differ in ways that may indicate leakage.
- Temporal consistency check: verify that no feature uses information from after the prediction time.
Common Confusions
Cross-validation does not eliminate the need for a test set
Cross-validation reuses data for training and validation efficiently, but the CV estimate is used for model selection. You still need a held-out test set that was never involved in any model selection decision. Otherwise, the CV estimate becomes optimistic.
Larger test sets are not always better
Making the test set larger shrinks the training set, potentially degrading model quality. The test set needs to be large enough for reliable performance estimates (a few thousand examples for classification with moderate class counts) but not so large that it starves training.
Exercises
Problem
You have 10,000 samples. You compute PCA on all 10,000 samples to reduce dimensionality, then split into 8,000 train and 2,000 test, then train a classifier. Is there leakage? If so, how do you fix it?
Problem
You are predicting next-day stock returns. You randomly shuffle the dataset and split 80/20 into train and test. Your model achieves 65% directional accuracy on the test set. In production, it achieves 51% (barely above chance). Diagnose the problem.
References
Canonical:
- Hastie, Tibshirani, Friedman, The Elements of Statistical Learning (2009), Chapter 7.10
- Kaufman et al., "Leakage in Data Mining" (2012), ACM TKDD
Current:
-
Kapoor & Narayanan, "Leakage and the Reproducibility Crisis in ML-based Science" (2023), Patterns
-
Shalev-Shwartz & Ben-David, Understanding Machine Learning (2014), Chapters 11-14
-
Murphy, Machine Learning: A Probabilistic Perspective (2012), Chapters 5-7
Next Topics
- Cross-validation theory: resampling methods for more efficient data use
- Exploratory data analysis: understanding data before splitting and modeling
Last reviewed: April 2026