Beta. Content is under active construction and has not been peer-reviewed. Report errors on GitHub.Disclaimer

Methodology

Train-Test Split and Data Leakage

Why you need three data splits, how to construct them correctly, and the common ways information leaks from test data into training. Temporal splits, stratified splits, and leakage detection.

CoreTier 1Stable~35 min

Why This Matters

The entire point of machine learning is to make predictions on unseen data. If your evaluation uses data that was available during training, your reported metrics are fiction. Data leakage is the most common source of unreproducible ML results, and it is often invisible until deployment.

Correct: no leakageFull datasetTrain (fit here)Test (eval only)Preprocess on trainApply same transform No information flows test trainWrong: data leakageFull datasetPreprocess ALL dataTrainTestLEAKAGE Test stats contaminated training transform

Mental Model

You have one dataset. You need to answer three questions:

  1. What hyperparameters work best? (validation set)
  2. How well will the final model perform? (test set)
  3. Where do the model's parameters come from? (training set)

Each question requires data that was not used to answer the other questions.

The Three-Way Split

Definition

Train-Validation-Test Split

Partition a dataset DD into three disjoint subsets:

  • Training set DtrainD_{\text{train}}: used to fit model parameters
  • Validation set DvalD_{\text{val}}: used to select hyperparameters and model architecture
  • Test set DtestD_{\text{test}}: used exactly once to estimate final performance

Common ratios: 70/15/15 or 80/10/10. For large datasets (n>100,000n > 100{,}000), even 98/1/1 can suffice because the validation and test sets are still large in absolute terms.

Why Two Held-Out Sets?

If you tune hyperparameters on the test set, you are fitting to the test set. The test error becomes an optimistic estimate of true performance. The validation set absorbs this optimization pressure, keeping the test set uncontaminated.

Stratified Splits

For classification with imbalanced classes, a random split may put few or no examples of a rare class in the validation or test set. A stratified split ensures each subset has approximately the same class distribution as the full dataset.

Temporal Splits for Time Series

Definition

Temporal Split

For data with a time dimension, split by time: train on data before time t1t_1, validate on data from t1t_1 to t2t_2, test on data after t2t_2. This prevents the model from seeing future information during training.

Random splits on time series data cause leakage because adjacent time points are correlated. A model trained on Monday and Wednesday data, tested on Tuesday data, has seen the future.

Data Leakage

Definition

Data Leakage

Data leakage occurs when information from outside the training set is used to create the model. This includes information from the validation set, test set, or real-world future data. Leakage produces overly optimistic performance estimates that do not generalize.

Proposition

Leakage Inflates Apparent Performance

Statement

Let R^leak\hat{R}_{\text{leak}} be the empirical risk measured on a contaminated test set and RtrueR_{\text{true}} be the population risk on clean data. If the leakage provides ϵ\epsilon bits of information about the label, then:

R^leakRtrueΔ(ϵ)\hat{R}_{\text{leak}} \leq R_{\text{true}} - \Delta(\epsilon)

where Δ(ϵ)>0\Delta(\epsilon) > 0 is a gap that grows with the information leaked. The apparent performance R^leak\hat{R}_{\text{leak}} is strictly better than the true performance RtrueR_{\text{true}}.

Intuition

Leakage gives the model a cheat sheet. Performance on the cheat-sheet evaluation is better than performance on a fair evaluation. The more information leaked, the larger the gap between reported and actual performance.

Proof Sketch

The leaking feature provides mutual information I(Xleak;Y)=ϵ>0I(X_{\text{leak}}; Y) = \epsilon > 0 about the target. Any model that exploits this feature achieves lower empirical risk on the contaminated evaluation than on clean data. By Fano's inequality, the excess information reduces the Bayes error on the contaminated set relative to the clean set.

Why It Matters

Leakage can make a bad model look good. A Kaggle competition winner with leakage has zero value in production. Detecting leakage before deployment saves months of wasted engineering effort.

Failure Mode

If the leaking feature is uncorrelated with the target, or the model lacks capacity to exploit it, leakage may have no effect on measured performance. This does not mean the split is clean; it means the leakage happened to be harmless this time.

Common Leakage Sources

1. Preprocessing Before Splitting

Fitting a scaler (mean/standard deviation) or PCA on the full dataset before splitting leaks test set statistics into the training set. The correct procedure: split first, then fit preprocessing on the training set only, then apply the fitted transform to validation and test sets.

2. Target Encoding with Test Data

Target encoding (replacing a categorical feature with the mean of the target for that category) computed on the full dataset leaks target information from the test set into the features.

3. Temporal Leakage

Using features computed from future data. Example: predicting hospital readmission using the patient's next diagnosis code. The next diagnosis is recorded after the readmission event.

4. Duplicate or Near-Duplicate Rows

If the same data point appears in both training and test sets, the model memorizes it and test performance is inflated. This is common with image datasets (augmented versions of the same image) and web-scraped text.

5. Group Leakage

Multiple data points from the same entity (patient, user, device) in both training and test sets. The model learns entity-specific patterns instead of generalizable patterns. Use group-aware splits to keep all data from one entity in the same subset.

Concrete Leakage Examples from Practice

Example

Medical prediction with future diagnosis codes

A hospital wants to predict 30-day readmission at the time of discharge. The feature set includes all diagnosis codes associated with the patient. However, some codes are added to the patient record after readmission (e.g., the admitting diagnosis of the second visit). Including these codes as features gives the model access to information that would not be available at the actual prediction time (discharge). The model achieves 92% AUC in offline evaluation but only 64% AUC in production. The 28-point gap is entirely due to temporal leakage.

Fix: filter all features to include only information available at the prediction time. This requires a careful audit of every feature's timestamp relative to the prediction event.

Example

Kaggle competition leakage via row IDs

In multiple Kaggle competitions, participants discovered that the row ID or file name contained information about the target. For instance, if images were stored as class_label_0001.jpg, the file name leaks the label. Even if the labels are not directly in the file name, sorted row IDs can correlate with time of collection, which correlates with the target through temporal patterns.

A competition to predict earthquake damage had row IDs assigned chronologically. A model that simply learned "later IDs have higher damage" achieved top-10 performance. This is leakage: the ID is a proxy for temporal information that would not be available in a real prediction setting.

Example

Target leakage in churn prediction

Predicting customer churn using features that include "number of calls to customer support in the last 30 days" seems reasonable. But if "churn" is defined as cancellation within 30 days, and the support calls happened after the customer decided to leave (calling to cancel), the feature is a consequence of the target, not a predictor of it. Including it gives the model a near-perfect signal that is unavailable at the time you would make the prediction.

The diagnostic: if a feature's importance is unreasonably high (e.g., a single feature gives 95% of the model's predictive power), suspect leakage. Investigate the causal relationship between the feature and the target.

How to Detect Leakage

  1. Suspiciously high performance: if your model achieves near-perfect accuracy on a non-trivial task, suspect leakage first.
  2. Feature importance analysis: if a feature has unreasonably high importance, check whether it could contain target information.
  3. Adversarial validation: train a classifier to distinguish training from test examples. If it succeeds, the distributions differ in ways that may indicate leakage.
  4. Temporal consistency check: verify that no feature uses information from after the prediction time.

Common Confusions

Watch Out

Cross-validation does not eliminate the need for a test set

Cross-validation reuses data for training and validation efficiently, but the CV estimate is used for model selection. You still need a held-out test set that was never involved in any model selection decision. Otherwise, the CV estimate becomes optimistic.

Watch Out

Larger test sets are not always better

Making the test set larger shrinks the training set, potentially degrading model quality. The test set needs to be large enough for reliable performance estimates (a few thousand examples for classification with moderate class counts) but not so large that it starves training.

Exercises

ExerciseCore

Problem

You have 10,000 samples. You compute PCA on all 10,000 samples to reduce dimensionality, then split into 8,000 train and 2,000 test, then train a classifier. Is there leakage? If so, how do you fix it?

ExerciseAdvanced

Problem

You are predicting next-day stock returns. You randomly shuffle the dataset and split 80/20 into train and test. Your model achieves 65% directional accuracy on the test set. In production, it achieves 51% (barely above chance). Diagnose the problem.

References

Canonical:

  • Hastie, Tibshirani, Friedman, The Elements of Statistical Learning (2009), Chapter 7.10
  • Kaufman et al., "Leakage in Data Mining" (2012), ACM TKDD

Current:

  • Kapoor & Narayanan, "Leakage and the Reproducibility Crisis in ML-based Science" (2023), Patterns

  • Shalev-Shwartz & Ben-David, Understanding Machine Learning (2014), Chapters 11-14

  • Murphy, Machine Learning: A Probabilistic Perspective (2012), Chapters 5-7

Next Topics

Last reviewed: April 2026

Next Topics