Model Evaluation Best Practices

Sneiderman, Robby

Methodology

Model Evaluation Best Practices

Train/validation/test splits, cross-validation, stratified and temporal splits, data leakage, reporting with confidence intervals, and statistical tests for model comparison. Why single-number comparisons are misleading.

CoreTier 1StableSupporting~45 min

Prerequisites

Confusion Matrices and Classification Metrics Bayesian Optimization for Hyperparameters

Start 8-question practice · 14 available 3-question pulse check Prereq Map

Learning position

Read this page in the graph.

methodology | layer 1 | tier 1. This page has 2 direct prerequisites and 1 published dependent.

Open Atlas Prerequisites Leads to

What next

Cross-Validation Theory

This is the first curated or graph-derived continuation from the current page.

Evidence badge

Claim status

This page has no public Lean mapping yet. Use the evidence page to inspect how claim status labels work.

Show the backing system

AtlasOpen the full prerequisite graph and run grounding traces.EvidenceInspect source support, claim labels, and public trust status.LeanReview the checked declaration list, scopes, and axiom profile.

Why This Matters

Correct evaluation pipeline (green) and common mistakes (red annotations)

A model's reported performance is only meaningful when the evaluation methodology is sound. A model that appears to achieve 99% accuracy may be benefiting from data leakage. A model that beats a baseline by 0.5% may be within noise. Most ML papers compare single numbers on single splits, which tells you almost nothing about true performance differences.

Correct evaluation is not optional. It determines whether your model actually works.

Train / Validation / Test Split

Definition

Three-Way Split

Partition the dataset $\mathcal{D}$ into three disjoint sets:

Training set $\mathcal{D}_{\text{train}}$ : used to fit model parameters
Validation set $\mathcal{D}_{\text{val}}$ : used for hyperparameter selection and early stopping
Test set $\mathcal{D}_{\text{test}}$ : used once for final performance estimation

The test set must be touched exactly once. Repeated evaluation on the test set (and selecting the best result) converts it into a validation set, invalidating the performance estimate.

Typical splits: 60/20/20 or 80/10/10. The exact ratio depends on dataset size. With 10M examples, even 1% is 100k examples, which is sufficient for tight confidence intervals.

Cross-Validation

When data is limited, a single train/val split wastes data. Cross-validation reuses data for both training and validation.

Definition

K-Fold Cross-Validation

Partition $\mathcal{D}$ into $K$ equal folds $F_1, \ldots, F_K$ . For each $k = 1, \ldots, K$ : train on $\mathcal{D} \setminus F_k$ and evaluate on $F_k$ . The cross-validation estimate is:

$\hat{R}_{\text{CV}} = \frac{1}{K} \sum_{k=1}^K \hat{R}(F_k)$

where $\hat{R}(F_k)$ is the loss on fold $k$ .

Proposition

Bias-Variance of Cross-Validation

Statement

The $K$ -fold cross-validation estimator $\hat{R}_{\text{CV}}$ has the following properties:

Bias: $\hat{R}_{\text{CV}}$ is approximately unbiased for the risk of a model trained on $n(K-1)/K$ examples, not $n$ examples. When $K$ is large (leave-one-out, $K = n$ ), the bias is small.
Variance: for large $K$ , the folds overlap heavily (each pair of training sets shares $n - 2n/K$ examples), causing the fold estimates to be correlated. This increases the variance of $\hat{R}_{\text{CV}}$ .

The bias decreases with $K$ while the variance increases with $K$ . The common choice $K = 5$ or $K = 10$ balances these two effects.

Intuition

Small $K$ (e.g., $K = 2$ ): each fold trains on only half the data, so the performance estimate is pessimistically biased (the model has less data than the final model will). Large $K$ (e.g., $K = n$ ): almost no bias, but fold estimates are nearly identical because training sets differ by only one example, so the variance of the average is high.

Proof Sketch

The bias follows from the observation that training on $n(K-1)/K < n$ examples gives worse performance than training on $n$ examples (learning curves are monotonically decreasing in expectation). For variance, decompose $\text{Var}(\hat{R}_{\text{CV}}) = \frac{1}{K^2}[\sum_k \text{Var}(\hat{R}_k) + \sum_{j \neq k} \text{Cov}(\hat{R}_j, \hat{R}_k)]$ . The covariance terms are positive and increase with $K$ because fold training sets overlap more. Detailed analysis by Bengio and Grandvalet (2004).

Why It Matters

Understanding this tradeoff prevents two common mistakes: using $K = 2$ (too much bias) or using leave-one-out (too much variance, and expensive). It also explains why you should not treat the cross-validation standard deviation as a confidence interval without correction for fold correlation.

Failure Mode

Cross-validation assumes that the data is exchangeable (any example could appear in any fold). This fails for time series data (future data leaks into training folds) and for grouped data (examples from the same patient/user/session should not be split across folds). See the sections on temporal splits and grouped splits below.

report a correction →

Stratified Splits for Imbalanced Data

When the positive class is rare (e.g., 2% fraud), a random split may produce folds with no positive examples. Stratified splitting ensures each fold has approximately the same class distribution as the full dataset.

For multi-label or regression tasks, stratification is harder. For regression, bin the target into quantiles and stratify by bin. For multi-label, use iterative stratification (Sechidis et al., 2011).

Temporal Splits for Time Series

For data with a time component, random splitting introduces leakage: the model sees future data during training and predicts past data at test time.

Correct approach: train on data before time $t$ , validate on data in $[t, t + \Delta)$ , test on data after $t + \Delta$ . This is sometimes called "expanding window" or "walk-forward" validation.

Incorrect approach: randomly shuffling time-stamped data and doing $K$ -fold CV. This inflates performance because the model exploits temporal autocorrelation.

Data Leakage

Definition

Data Leakage

Data leakage occurs when information from the test set (or from the prediction target) is available to the model during training. Leakage inflates performance estimates. Models with leakage appear to perform well in evaluation but fail in deployment.

Common sources of leakage:

Feature leakage. A feature that is a direct function of the target. Example: including "loan default date" as a feature when predicting loan default. This feature is only available after the event you are predicting.

Preprocessing leakage. Fitting a scaler, PCA, or imputer on the full dataset (including test data) before splitting. The correct approach: fit preprocessing only on training data, then transform validation and test data using the training-fitted parameters.

Temporal leakage. Using future data to predict past events. Even a single future feature (e.g., "next month's stock price") makes the model useless in production.

Group leakage. Examples from the same entity (patient, user, document) appear in both training and test sets. The model memorizes entity-specific patterns rather than learning generalizable features.

Reporting: Mean, Standard Deviation, and Significance

Report: metric = mean +/- std over $S$ seeds on the same test set.

Running the same model with different random seeds (different weight initialization, different data shuffling) produces different results. A single run is a single sample from this distribution. Report the mean and standard deviation over at least 3 runs (5 or more is better).

Statistical Tests for Model Comparison

Proposition

Paired Permutation Test for Model Comparison

Statement

Let $d_i = \ell(f_A(x_i), y_i) - \ell(f_B(x_i), y_i)$ be the per-example loss difference between models A and B. The sign-flip permutation test randomly flips the sign of each $d_i$ to generate the null distribution and computes the p-value as the fraction of permutations where $|\bar{d}_\pi| \geq |\bar{d}|$ , where $\bar{d} = \frac{1}{n}\sum_i d_i$ . Validity requires the sharp exchangeable null that the joint distribution of $(d_1, \ldots, d_n)$ is invariant under arbitrary sign flips. This holds, for example, under the sharp null that $f_A$ and $f_B$ predict identically (so $d_i \equiv 0$ ), or when each $d_i$ has a symmetric distribution about zero and the $d_i$ are independent. The weaker null $\mathbb{E}[d_i] = 0$ alone does not justify sign-flip permutation: a non-symmetric $d_i$ with mean zero (e.g., a skewed distribution) violates the sign-flip exchangeability and the test is no longer exactly calibrated.

Alternatively, the paired t-test gives $t = \bar{d} / (s_d / \sqrt{n})$ where $s_d$ is the sample standard deviation of the differences, with $n-1$ degrees of freedom.

Intuition

By evaluating both models on the same test examples, you cancel out example-level difficulty. The question becomes: does model A consistently do better than model B on the same examples? This is more powerful than comparing aggregate scores because it removes the variance due to different test examples.

Proof Sketch

Under the sharp null that the joint distribution of $(d_1, \ldots, d_n)$ is sign-flip invariant (in particular, under the sharp null $f_A = f_B$ , where $d_i \equiv 0$ , or under symmetry of each $d_i$ with independence), the signs of the differences are exchangeable. Randomly flipping signs therefore generates samples from the exact null distribution. The observed $\bar{d}$ is compared to this null. For the t-test version, apply the standard paired t-test derivation under the assumption that $d_i$ are approximately normal (justified for large $n$ by the CLT).

Why It Matters

This prevents the common mistake of declaring model A "better" because it scored 85.3% vs 85.1%. Without a statistical test, you cannot distinguish signal from noise. The paired test is particularly important when improvements are small (0.1-0.5%), which is common in mature ML tasks.

Failure Mode

The independence assumption fails when test examples are correlated (e.g., multiple examples from the same user). The paired t-test assumes normality of $d_i$ , which may not hold for binary loss (0/1). For binary outcomes, McNemar's test is more appropriate.

report a correction →

Evaluation Checklist

Before reporting any model performance number, verify each of the following:

Data integrity checks:

Confirmed no duplicate rows spanning train and test sets
Confirmed all preprocessing (scaling, PCA, imputation) was fit on training data only
For time series: confirmed all training data precedes all test data
For grouped data: confirmed all examples from the same entity are in the same split
Checked for features that are proxies for the target label

Statistical rigor checks:

Reported results over at least 3 random seeds (5 or more preferred)
Included confidence intervals or standard deviations with point estimates
Used paired statistical tests (permutation test or paired t-test) when claiming one model beats another
Verified the test set was used exactly once for final evaluation

Metric selection checks:

Reported more than one metric (accuracy alone is insufficient for imbalanced data)
Included calibration metrics if the model outputs probabilities
Checked for Simpson's paradox: aggregate improvement may mask subgroup degradation
Verified that the reported metric matches the deployment objective

Practical deployment checks:

Measured inference latency and throughput, not just quality metrics
Tested on data from the expected deployment distribution, not just the benchmark distribution
Checked model behavior on edge cases and out-of-distribution inputs
Verified that reported improvements exceed the noise floor (standard deviation across seeds)

Example

Model comparison done right

Task: binary classification on a medical dataset with 5000 examples (8% positive rate).

Stratified 70/15/15 split, ensuring both classes appear in all splits
Fit StandardScaler on training set only; apply to val and test
Train models A (logistic regression) and B (random forest) using 5-fold CV on training set for hyperparameter selection
Evaluate both on the same held-out test set (750 examples)
Report: Model A accuracy = 91.2% +/- 0.4% (5 seeds), AUC = 0.843, F1 = 0.52. Model B accuracy = 91.5% +/- 0.6% (5 seeds), AUC = 0.861, F1 = 0.57.
Paired permutation test on 750 test examples: $p = 0.23$ for accuracy difference, $p = 0.04$ for AUC difference.
Conclusion: AUC difference is statistically significant, but accuracy difference is not. Model B is better at ranking, but the classification threshold should be tuned separately.

Why Single-Number Comparisons Are Misleading

Reporting "Model A: 92.3%, Model B: 91.8%" invites the reader to conclude A is better. But:

Variance across seeds: A might be 92.3 +/- 0.8 and B might be 91.8 +/- 0.5. The difference is within noise.
Test set size: with 100 test examples, the standard error of accuracy is about 3%. The difference is meaningless.
Subgroup performance: A might beat B overall while B beats A on every subgroup (Simpson's paradox).
Cherry-picked metrics: accuracy, F1, AUC, and calibration can give different rankings.

Common Confusions

Watch Out

Validation performance is not test performance

Hyperparameters selected to maximize validation performance will overfit to the validation set. The gap between validation and test performance grows with the number of hyperparameter configurations tried. This is why the test set must be used only once.

Watch Out

Cross-validation does not eliminate the need for a test set

Cross-validation estimates performance for model selection (choosing among architectures or hyperparameters). After selection, you still need a held-out test set to estimate the final model's true performance. Using the CV estimate as the final performance number is optimistic because the selected model won a competition among candidates.

Watch Out

More folds is not always better

Leave-one-out CV ( $K = n$ ) minimizes bias but maximizes variance and computational cost. For most practical purposes, $K = 5$ or $K = 10$ gives a good bias-variance tradeoff and is $n/K$ times cheaper to compute.

Canonical Examples

Example

Detecting preprocessing leakage

Task: predict house prices. Feature pipeline includes standardization ( $z = (x - \mu)/\sigma$ ). If $\mu$ and $\sigma$ are computed on the full dataset (including test), test features are transformed using test-set statistics. This leaks test-set information into the features. The correct approach: compute $\mu$ and $\sigma$ on training data only, then apply the same $\mu$ and $\sigma$ to transform test data. The performance difference can be small (0.1-1%) but compounds with more preprocessing steps.

Summary

Three-way split: train (fit parameters), validate (select hyperparameters), test (estimate final performance, used once)
$K$ -fold CV: $K = 5$ or $K = 10$ balances bias and variance
Stratify folds for imbalanced data; use temporal splits for time series
Data leakage inflates metrics and causes deployment failures
Report mean +/- std over multiple seeds, not a single number
Use paired statistical tests (permutation test or paired t-test) to compare models
A 0.3% improvement means nothing without a significance test

Exercises

ExerciseCore

Problem

You fit a StandardScaler on your full dataset, then split into train/test, then train a model. Your test accuracy is 94%. After fixing the leakage (fitting the scaler on train only), test accuracy drops to 91%. Explain what happened.

ExerciseAdvanced

Problem

Model A achieves 85.3% accuracy and Model B achieves 85.0% accuracy on a shared test set of $n = 500$ examples. The paired differences $d_i$ have sample standard deviation $s_d = 0.35$ . Compute the paired t-statistic and determine if the difference is statistically significant at $\alpha = 0.05$ .

References

Canonical:

Hastie, Tibshirani, Friedman, The Elements of Statistical Learning (2009), Chapter 7
Bengio & Grandvalet, "No Unbiased Estimator of the Variance of K-Fold Cross-Validation" (2004)

Current:

Bouthillier et al., "Accounting for Variance in Machine Learning Benchmarks" (2021)
Raschka, "Model Evaluation, Model Selection, and Algorithm Selection in Machine Learning" (2020)
Kaufman et al., "Leakage in Data Mining" (2012)
Shalev-Shwartz & Ben-David, Understanding Machine Learning (2014), Chapters 11-14

Next Topics

Cross-validation theory: formal analysis of CV estimator properties
Hypothesis testing for ML: statistical testing framework for model comparison

Last reviewed: April 14, 2026

Canonical graph

Required before and derived from this topic

These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.

Full prerequisite chain All derived topics

Required prerequisites

2

Confusion Matrices and Classification Metricslayer 1 · tier 1
Bayesian Optimization for Hyperparameterslayer 3 · tier 2

Derived topics

1

Cross-Validation Theorylayer 2 · tier 2

Graph-backed continuations

Cross-Validation Theory