Beta. Content is under active construction and has not been peer-reviewed. Report errors on GitHub.Disclaimer

Methodology

Model Evaluation Best Practices

Train/validation/test splits, cross-validation, stratified and temporal splits, data leakage, reporting with confidence intervals, and statistical tests for model comparison. Why single-number comparisons are misleading.

CoreTier 1Stable~45 min

Why This Matters

A model's reported performance is only meaningful if the evaluation methodology is sound. A model that appears to achieve 99% accuracy may be benefiting from data leakage. A model that beats a baseline by 0.5% may be within noise. Most ML papers compare single numbers on single splits, which tells you almost nothing about true performance differences.

Correct evaluation is not optional. It determines whether your model actually works.

Train / Validation / Test Split

Definition

Three-Way Split

Partition the dataset D\mathcal{D} into three disjoint sets:

  • Training set Dtrain\mathcal{D}_{\text{train}}: used to fit model parameters
  • Validation set Dval\mathcal{D}_{\text{val}}: used for hyperparameter selection and early stopping
  • Test set Dtest\mathcal{D}_{\text{test}}: used once for final performance estimation

The test set must be touched exactly once. Repeated evaluation on the test set (and selecting the best result) converts it into a validation set, invalidating the performance estimate.

Typical splits: 60/20/20 or 80/10/10. The exact ratio depends on dataset size. With 10M examples, even 1% is 100k examples, which is sufficient for tight confidence intervals.

Cross-Validation

When data is limited, a single train/val split wastes data. Cross-validation reuses data for both training and validation.

Definition

K-Fold Cross-Validation

Partition D\mathcal{D} into KK equal folds F1,,FKF_1, \ldots, F_K. For each k=1,,Kk = 1, \ldots, K: train on DFk\mathcal{D} \setminus F_k and evaluate on FkF_k. The cross-validation estimate is:

R^CV=1Kk=1KR^(Fk)\hat{R}_{\text{CV}} = \frac{1}{K} \sum_{k=1}^K \hat{R}(F_k)

where R^(Fk)\hat{R}(F_k) is the loss on fold kk.

Proposition

Bias-Variance of Cross-Validation

Statement

The KK-fold cross-validation estimator R^CV\hat{R}_{\text{CV}} has the following properties:

  • Bias: R^CV\hat{R}_{\text{CV}} is approximately unbiased for the risk of a model trained on n(K1)/Kn(K-1)/K examples, not nn examples. When KK is large (leave-one-out, K=nK = n), the bias is small.
  • Variance: for large KK, the folds overlap heavily (each pair of training sets shares n2n/Kn - 2n/K examples), causing the fold estimates to be correlated. This increases the variance of R^CV\hat{R}_{\text{CV}}.

The bias decreases with KK while the variance increases with KK. The common choice K=5K = 5 or K=10K = 10 balances these two effects.

Intuition

Small KK (e.g., K=2K = 2): each fold trains on only half the data, so the performance estimate is pessimistically biased (the model has less data than the final model will). Large KK (e.g., K=nK = n): almost no bias, but fold estimates are nearly identical because training sets differ by only one example, so the variance of the average is high.

Proof Sketch

The bias follows from the observation that training on n(K1)/K<nn(K-1)/K < n examples gives worse performance than training on nn examples (learning curves are monotonically decreasing in expectation). For variance, decompose Var(R^CV)=1K2[kVar(R^k)+jkCov(R^j,R^k)]\text{Var}(\hat{R}_{\text{CV}}) = \frac{1}{K^2}[\sum_k \text{Var}(\hat{R}_k) + \sum_{j \neq k} \text{Cov}(\hat{R}_j, \hat{R}_k)]. The covariance terms are positive and increase with KK because fold training sets overlap more. Detailed analysis by Bengio and Grandvalet (2004).

Why It Matters

Understanding this tradeoff prevents two common mistakes: using K=2K = 2 (too much bias) or using leave-one-out (too much variance, and expensive). It also explains why you should not treat the cross-validation standard deviation as a confidence interval without correction for fold correlation.

Failure Mode

Cross-validation assumes that the data is exchangeable (any example could appear in any fold). This fails for time series data (future data leaks into training folds) and for grouped data (examples from the same patient/user/session should not be split across folds). See the sections on temporal splits and grouped splits below.

Stratified Splits for Imbalanced Data

When the positive class is rare (e.g., 2% fraud), a random split may produce folds with no positive examples. Stratified splitting ensures each fold has approximately the same class distribution as the full dataset.

For multi-label or regression tasks, stratification is harder. For regression, bin the target into quantiles and stratify by bin. For multi-label, use iterative stratification (Sechidis et al., 2011).

Temporal Splits for Time Series

For data with a time component, random splitting introduces leakage: the model sees future data during training and predicts past data at test time.

Correct approach: train on data before time tt, validate on data in [t,t+Δ)[t, t + \Delta), test on data after t+Δt + \Delta. This is sometimes called "expanding window" or "walk-forward" validation.

Incorrect approach: randomly shuffling time-stamped data and doing KK-fold CV. This inflates performance because the model exploits temporal autocorrelation.

Data Leakage

Definition

Data Leakage

Data leakage occurs when information from the test set (or from the prediction target) is available to the model during training. Leakage inflates performance estimates. Models with leakage appear to perform well in evaluation but fail in deployment.

Common sources of leakage:

Feature leakage. A feature that is a direct function of the target. Example: including "loan default date" as a feature when predicting loan default. This feature is only available after the event you are predicting.

Preprocessing leakage. Fitting a scaler, PCA, or imputer on the full dataset (including test data) before splitting. The correct approach: fit preprocessing only on training data, then transform validation and test data using the training-fitted parameters.

Temporal leakage. Using future data to predict past events. Even a single future feature (e.g., "next month's stock price") makes the model useless in production.

Group leakage. Examples from the same entity (patient, user, document) appear in both training and test sets. The model memorizes entity-specific patterns rather than learning generalizable features.

Reporting: Mean, Standard Deviation, and Significance

Report: metric = mean +/- std over SS seeds on the same test set.

Running the same model with different random seeds (different weight initialization, different data shuffling) produces different results. A single run is a single sample from this distribution. Report the mean and standard deviation over at least 3 runs (5 or more is better).

Statistical Tests for Model Comparison

Proposition

Paired Permutation Test for Model Comparison

Statement

Let di=(fA(xi),yi)(fB(xi),yi)d_i = \ell(f_A(x_i), y_i) - \ell(f_B(x_i), y_i) be the per-example loss difference between models A and B. Under the null hypothesis that A and B have equal expected loss, the test statistic dˉ=1nidi\bar{d} = \frac{1}{n}\sum_i d_i has expectation zero. A paired permutation test randomly flips the sign of each did_i to generate the null distribution. The p-value is the fraction of permutations where dˉπdˉ|\bar{d}_\pi| \geq |\bar{d}|.

Alternatively, the paired t-test gives t=dˉ/(sd/n)t = \bar{d} / (s_d / \sqrt{n}) where sds_d is the sample standard deviation of the differences, with n1n-1 degrees of freedom.

Intuition

By evaluating both models on the same test examples, you cancel out example-level difficulty. The question becomes: does model A consistently do better than model B on the same examples? This is more powerful than comparing aggregate scores because it removes the variance due to different test examples.

Proof Sketch

Under the null hypothesis (equal performance), the signs of did_i are equally likely to be positive or negative. Randomly flipping signs generates samples from the null distribution. The observed dˉ\bar{d} is compared to this null. For the t-test version, apply the standard paired t-test derivation under the assumption that did_i are approximately normal (justified for large nn by the CLT).

Why It Matters

This prevents the common mistake of declaring model A "better" because it scored 85.3% vs 85.1%. Without a statistical test, you cannot distinguish signal from noise. The paired test is particularly important when improvements are small (0.1-0.5%), which is common in mature ML tasks.

Failure Mode

The independence assumption fails when test examples are correlated (e.g., multiple examples from the same user). The paired t-test assumes normality of did_i, which may not hold for binary loss (0/1). For binary outcomes, McNemar's test is more appropriate.

Evaluation Checklist

Before reporting any model performance number, verify each of the following:

Data integrity checks:

  • Confirmed no duplicate rows spanning train and test sets
  • Confirmed all preprocessing (scaling, PCA, imputation) was fit on training data only
  • For time series: confirmed all training data precedes all test data
  • For grouped data: confirmed all examples from the same entity are in the same split
  • Checked for features that are proxies for the target label

Statistical rigor checks:

  • Reported results over at least 3 random seeds (5 or more preferred)
  • Included confidence intervals or standard deviations with point estimates
  • Used paired statistical tests (permutation test or paired t-test) when claiming one model beats another
  • Verified the test set was used exactly once for final evaluation

Metric selection checks:

  • Reported more than one metric (accuracy alone is insufficient for imbalanced data)
  • Included calibration metrics if the model outputs probabilities
  • Checked for Simpson's paradox: aggregate improvement may mask subgroup degradation
  • Verified that the reported metric matches the deployment objective

Practical deployment checks:

  • Measured inference latency and throughput, not just quality metrics
  • Tested on data from the expected deployment distribution, not just the benchmark distribution
  • Checked model behavior on edge cases and out-of-distribution inputs
  • Verified that reported improvements exceed the noise floor (standard deviation across seeds)
Example

Model comparison done right

Task: binary classification on a medical dataset with 5000 examples (8% positive rate).

  1. Stratified 70/15/15 split, ensuring both classes appear in all splits
  2. Fit StandardScaler on training set only; apply to val and test
  3. Train models A (logistic regression) and B (random forest) using 5-fold CV on training set for hyperparameter selection
  4. Evaluate both on the same held-out test set (750 examples)
  5. Report: Model A accuracy = 91.2% +/- 0.4% (5 seeds), AUC = 0.843, F1 = 0.52. Model B accuracy = 91.5% +/- 0.6% (5 seeds), AUC = 0.861, F1 = 0.57.
  6. Paired permutation test on 750 test examples: p=0.23p = 0.23 for accuracy difference, p=0.04p = 0.04 for AUC difference.
  7. Conclusion: AUC difference is statistically significant, but accuracy difference is not. Model B is better at ranking, but the classification threshold should be tuned separately.

Why Single-Number Comparisons Are Misleading

Reporting "Model A: 92.3%, Model B: 91.8%" invites the reader to conclude A is better. But:

  1. Variance across seeds: A might be 92.3 +/- 0.8 and B might be 91.8 +/- 0.5. The difference is within noise.
  2. Test set size: with 100 test examples, the standard error of accuracy is about 3%. The difference is meaningless.
  3. Subgroup performance: A might beat B overall while B beats A on every subgroup (Simpson's paradox).
  4. Cherry-picked metrics: accuracy, F1, AUC, and calibration can give different rankings.

Common Confusions

Watch Out

Validation performance is not test performance

Hyperparameters selected to maximize validation performance will overfit to the validation set. The gap between validation and test performance grows with the number of hyperparameter configurations tried. This is why the test set must be used only once.

Watch Out

Cross-validation does not eliminate the need for a test set

Cross-validation estimates performance for model selection (choosing among architectures or hyperparameters). After selection, you still need a held-out test set to estimate the final model's true performance. Using the CV estimate as the final performance number is optimistic because the selected model won a competition among candidates.

Watch Out

More folds is not always better

Leave-one-out CV (K=nK = n) minimizes bias but maximizes variance and computational cost. For most practical purposes, K=5K = 5 or K=10K = 10 gives a good bias-variance tradeoff and is n/Kn/K times cheaper to compute.

Canonical Examples

Example

Detecting preprocessing leakage

Task: predict house prices. Feature pipeline includes standardization (z=(xμ)/σz = (x - \mu)/\sigma). If μ\mu and σ\sigma are computed on the full dataset (including test), test features are transformed using test-set statistics. This leaks test-set information into the features. The correct approach: compute μ\mu and σ\sigma on training data only, then apply the same μ\mu and σ\sigma to transform test data. The performance difference can be small (0.1-1%) but compounds with more preprocessing steps.

Summary

  • Three-way split: train (fit parameters), validate (select hyperparameters), test (estimate final performance, used once)
  • KK-fold CV: K=5K = 5 or K=10K = 10 balances bias and variance
  • Stratify folds for imbalanced data; use temporal splits for time series
  • Data leakage inflates metrics and causes deployment failures
  • Report mean +/- std over multiple seeds, not a single number
  • Use paired statistical tests (permutation test or paired t-test) to compare models
  • A 0.3% improvement means nothing without a significance test

Exercises

ExerciseCore

Problem

You fit a StandardScaler on your full dataset, then split into train/test, then train a model. Your test accuracy is 94%. After fixing the leakage (fitting the scaler on train only), test accuracy drops to 91%. Explain what happened.

ExerciseAdvanced

Problem

Model A achieves 85.3% accuracy and Model B achieves 85.0% accuracy on a shared test set of n=500n = 500 examples. The paired differences did_i have sample standard deviation sd=0.35s_d = 0.35. Compute the paired t-statistic and determine if the difference is statistically significant at α=0.05\alpha = 0.05.

References

Canonical:

  • Hastie, Tibshirani, Friedman, The Elements of Statistical Learning (2009), Chapter 7
  • Bengio & Grandvalet, "No Unbiased Estimator of the Variance of K-Fold Cross-Validation" (2004)

Current:

  • Bouthillier et al., "Accounting for Variance in Machine Learning Benchmarks" (2021)

  • Raschka, "Model Evaluation, Model Selection, and Algorithm Selection in Machine Learning" (2020)

  • Kaufman et al., "Leakage in Data Mining" (2012)

  • Shalev-Shwartz & Ben-David, Understanding Machine Learning (2014), Chapters 11-14

Next Topics

Last reviewed: April 2026

Prerequisites

Foundations this topic depends on.

Next Topics