Methodology
Model Evaluation Best Practices
Train/validation/test splits, cross-validation, stratified and temporal splits, data leakage, reporting with confidence intervals, and statistical tests for model comparison. Why single-number comparisons are misleading.
Prerequisites
Why This Matters
A model's reported performance is only meaningful if the evaluation methodology is sound. A model that appears to achieve 99% accuracy may be benefiting from data leakage. A model that beats a baseline by 0.5% may be within noise. Most ML papers compare single numbers on single splits, which tells you almost nothing about true performance differences.
Correct evaluation is not optional. It determines whether your model actually works.
Train / Validation / Test Split
Three-Way Split
Partition the dataset into three disjoint sets:
- Training set : used to fit model parameters
- Validation set : used for hyperparameter selection and early stopping
- Test set : used once for final performance estimation
The test set must be touched exactly once. Repeated evaluation on the test set (and selecting the best result) converts it into a validation set, invalidating the performance estimate.
Typical splits: 60/20/20 or 80/10/10. The exact ratio depends on dataset size. With 10M examples, even 1% is 100k examples, which is sufficient for tight confidence intervals.
Cross-Validation
When data is limited, a single train/val split wastes data. Cross-validation reuses data for both training and validation.
K-Fold Cross-Validation
Partition into equal folds . For each : train on and evaluate on . The cross-validation estimate is:
where is the loss on fold .
Bias-Variance of Cross-Validation
Statement
The -fold cross-validation estimator has the following properties:
- Bias: is approximately unbiased for the risk of a model trained on examples, not examples. When is large (leave-one-out, ), the bias is small.
- Variance: for large , the folds overlap heavily (each pair of training sets shares examples), causing the fold estimates to be correlated. This increases the variance of .
The bias decreases with while the variance increases with . The common choice or balances these two effects.
Intuition
Small (e.g., ): each fold trains on only half the data, so the performance estimate is pessimistically biased (the model has less data than the final model will). Large (e.g., ): almost no bias, but fold estimates are nearly identical because training sets differ by only one example, so the variance of the average is high.
Proof Sketch
The bias follows from the observation that training on examples gives worse performance than training on examples (learning curves are monotonically decreasing in expectation). For variance, decompose . The covariance terms are positive and increase with because fold training sets overlap more. Detailed analysis by Bengio and Grandvalet (2004).
Why It Matters
Understanding this tradeoff prevents two common mistakes: using (too much bias) or using leave-one-out (too much variance, and expensive). It also explains why you should not treat the cross-validation standard deviation as a confidence interval without correction for fold correlation.
Failure Mode
Cross-validation assumes that the data is exchangeable (any example could appear in any fold). This fails for time series data (future data leaks into training folds) and for grouped data (examples from the same patient/user/session should not be split across folds). See the sections on temporal splits and grouped splits below.
Stratified Splits for Imbalanced Data
When the positive class is rare (e.g., 2% fraud), a random split may produce folds with no positive examples. Stratified splitting ensures each fold has approximately the same class distribution as the full dataset.
For multi-label or regression tasks, stratification is harder. For regression, bin the target into quantiles and stratify by bin. For multi-label, use iterative stratification (Sechidis et al., 2011).
Temporal Splits for Time Series
For data with a time component, random splitting introduces leakage: the model sees future data during training and predicts past data at test time.
Correct approach: train on data before time , validate on data in , test on data after . This is sometimes called "expanding window" or "walk-forward" validation.
Incorrect approach: randomly shuffling time-stamped data and doing -fold CV. This inflates performance because the model exploits temporal autocorrelation.
Data Leakage
Data Leakage
Data leakage occurs when information from the test set (or from the prediction target) is available to the model during training. Leakage inflates performance estimates. Models with leakage appear to perform well in evaluation but fail in deployment.
Common sources of leakage:
Feature leakage. A feature that is a direct function of the target. Example: including "loan default date" as a feature when predicting loan default. This feature is only available after the event you are predicting.
Preprocessing leakage. Fitting a scaler, PCA, or imputer on the full dataset (including test data) before splitting. The correct approach: fit preprocessing only on training data, then transform validation and test data using the training-fitted parameters.
Temporal leakage. Using future data to predict past events. Even a single future feature (e.g., "next month's stock price") makes the model useless in production.
Group leakage. Examples from the same entity (patient, user, document) appear in both training and test sets. The model memorizes entity-specific patterns rather than learning generalizable features.
Reporting: Mean, Standard Deviation, and Significance
Report: metric = mean +/- std over seeds on the same test set.
Running the same model with different random seeds (different weight initialization, different data shuffling) produces different results. A single run is a single sample from this distribution. Report the mean and standard deviation over at least 3 runs (5 or more is better).
Statistical Tests for Model Comparison
Paired Permutation Test for Model Comparison
Statement
Let be the per-example loss difference between models A and B. Under the null hypothesis that A and B have equal expected loss, the test statistic has expectation zero. A paired permutation test randomly flips the sign of each to generate the null distribution. The p-value is the fraction of permutations where .
Alternatively, the paired t-test gives where is the sample standard deviation of the differences, with degrees of freedom.
Intuition
By evaluating both models on the same test examples, you cancel out example-level difficulty. The question becomes: does model A consistently do better than model B on the same examples? This is more powerful than comparing aggregate scores because it removes the variance due to different test examples.
Proof Sketch
Under the null hypothesis (equal performance), the signs of are equally likely to be positive or negative. Randomly flipping signs generates samples from the null distribution. The observed is compared to this null. For the t-test version, apply the standard paired t-test derivation under the assumption that are approximately normal (justified for large by the CLT).
Why It Matters
This prevents the common mistake of declaring model A "better" because it scored 85.3% vs 85.1%. Without a statistical test, you cannot distinguish signal from noise. The paired test is particularly important when improvements are small (0.1-0.5%), which is common in mature ML tasks.
Failure Mode
The independence assumption fails when test examples are correlated (e.g., multiple examples from the same user). The paired t-test assumes normality of , which may not hold for binary loss (0/1). For binary outcomes, McNemar's test is more appropriate.
Evaluation Checklist
Before reporting any model performance number, verify each of the following:
Data integrity checks:
- Confirmed no duplicate rows spanning train and test sets
- Confirmed all preprocessing (scaling, PCA, imputation) was fit on training data only
- For time series: confirmed all training data precedes all test data
- For grouped data: confirmed all examples from the same entity are in the same split
- Checked for features that are proxies for the target label
Statistical rigor checks:
- Reported results over at least 3 random seeds (5 or more preferred)
- Included confidence intervals or standard deviations with point estimates
- Used paired statistical tests (permutation test or paired t-test) when claiming one model beats another
- Verified the test set was used exactly once for final evaluation
Metric selection checks:
- Reported more than one metric (accuracy alone is insufficient for imbalanced data)
- Included calibration metrics if the model outputs probabilities
- Checked for Simpson's paradox: aggregate improvement may mask subgroup degradation
- Verified that the reported metric matches the deployment objective
Practical deployment checks:
- Measured inference latency and throughput, not just quality metrics
- Tested on data from the expected deployment distribution, not just the benchmark distribution
- Checked model behavior on edge cases and out-of-distribution inputs
- Verified that reported improvements exceed the noise floor (standard deviation across seeds)
Model comparison done right
Task: binary classification on a medical dataset with 5000 examples (8% positive rate).
- Stratified 70/15/15 split, ensuring both classes appear in all splits
- Fit StandardScaler on training set only; apply to val and test
- Train models A (logistic regression) and B (random forest) using 5-fold CV on training set for hyperparameter selection
- Evaluate both on the same held-out test set (750 examples)
- Report: Model A accuracy = 91.2% +/- 0.4% (5 seeds), AUC = 0.843, F1 = 0.52. Model B accuracy = 91.5% +/- 0.6% (5 seeds), AUC = 0.861, F1 = 0.57.
- Paired permutation test on 750 test examples: for accuracy difference, for AUC difference.
- Conclusion: AUC difference is statistically significant, but accuracy difference is not. Model B is better at ranking, but the classification threshold should be tuned separately.
Why Single-Number Comparisons Are Misleading
Reporting "Model A: 92.3%, Model B: 91.8%" invites the reader to conclude A is better. But:
- Variance across seeds: A might be 92.3 +/- 0.8 and B might be 91.8 +/- 0.5. The difference is within noise.
- Test set size: with 100 test examples, the standard error of accuracy is about 3%. The difference is meaningless.
- Subgroup performance: A might beat B overall while B beats A on every subgroup (Simpson's paradox).
- Cherry-picked metrics: accuracy, F1, AUC, and calibration can give different rankings.
Common Confusions
Validation performance is not test performance
Hyperparameters selected to maximize validation performance will overfit to the validation set. The gap between validation and test performance grows with the number of hyperparameter configurations tried. This is why the test set must be used only once.
Cross-validation does not eliminate the need for a test set
Cross-validation estimates performance for model selection (choosing among architectures or hyperparameters). After selection, you still need a held-out test set to estimate the final model's true performance. Using the CV estimate as the final performance number is optimistic because the selected model won a competition among candidates.
More folds is not always better
Leave-one-out CV () minimizes bias but maximizes variance and computational cost. For most practical purposes, or gives a good bias-variance tradeoff and is times cheaper to compute.
Canonical Examples
Detecting preprocessing leakage
Task: predict house prices. Feature pipeline includes standardization (). If and are computed on the full dataset (including test), test features are transformed using test-set statistics. This leaks test-set information into the features. The correct approach: compute and on training data only, then apply the same and to transform test data. The performance difference can be small (0.1-1%) but compounds with more preprocessing steps.
Summary
- Three-way split: train (fit parameters), validate (select hyperparameters), test (estimate final performance, used once)
- -fold CV: or balances bias and variance
- Stratify folds for imbalanced data; use temporal splits for time series
- Data leakage inflates metrics and causes deployment failures
- Report mean +/- std over multiple seeds, not a single number
- Use paired statistical tests (permutation test or paired t-test) to compare models
- A 0.3% improvement means nothing without a significance test
Exercises
Problem
You fit a StandardScaler on your full dataset, then split into train/test, then train a model. Your test accuracy is 94%. After fixing the leakage (fitting the scaler on train only), test accuracy drops to 91%. Explain what happened.
Problem
Model A achieves 85.3% accuracy and Model B achieves 85.0% accuracy on a shared test set of examples. The paired differences have sample standard deviation . Compute the paired t-statistic and determine if the difference is statistically significant at .
References
Canonical:
- Hastie, Tibshirani, Friedman, The Elements of Statistical Learning (2009), Chapter 7
- Bengio & Grandvalet, "No Unbiased Estimator of the Variance of K-Fold Cross-Validation" (2004)
Current:
-
Bouthillier et al., "Accounting for Variance in Machine Learning Benchmarks" (2021)
-
Raschka, "Model Evaluation, Model Selection, and Algorithm Selection in Machine Learning" (2020)
-
Kaufman et al., "Leakage in Data Mining" (2012)
-
Shalev-Shwartz & Ben-David, Understanding Machine Learning (2014), Chapters 11-14
Next Topics
- Cross-validation theory: formal analysis of CV estimator properties
- Hypothesis testing for ML: statistical testing framework for model comparison
Last reviewed: April 2026
Prerequisites
Foundations this topic depends on.