Cross-Validation Theory

Sneiderman, Robby

Optimization Function Classes

Cross-Validation Theory

The theory behind cross-validation as a model selection tool: LOO-CV, K-fold, the bias-variance tradeoff of the CV estimator itself, and why CV estimates generalization error.

CoreTier 2StableSupporting~45 min

Prerequisites

Empirical Risk Minimization Bias Variance Tradeoff Aic and Bic Class Imbalance and Resampling

Start 8-question practice · 4 available Prereq Map

Learning position

Read this page in the graph.

optimization-function-classes | layer 2 | tier 2. This page has 15 direct prerequisites and 4 published dependents.

Open Atlas Prerequisites Leads to

What next

Algorithmic Stability

This is the first curated or graph-derived continuation from the current page.

Evidence badge

Claim status

This page has no public Lean mapping yet. Use the evidence page to inspect how claim status labels work.

Show the backing system

AtlasOpen the full prerequisite graph and run grounding traces.EvidenceInspect source support, claim labels, and public trust status.LeanReview the checked declaration list, scopes, and axiom profile.

Why This Matters

theorem visual

Cross-validation rotates the holdout instead of spending one fixed test set

$Each fold takes one turn as validation data. The average validation loss estimates generalization, but it is still a model-selection signal, not a final test score.$

K-fold score

$CV_{K} = \frac{1}{n} \sum_{k = 1}^{K} \sum_{z_{i} \in S_{k}} ℓ (A (S ∖ S_{k}), z_{i})$

$Train on K - 1 folds, validate on the remaining fold, then average.$

Bias tradeoff

$n_{train} = n (1 - 1/ K)$

$Large K trains on more data, so the estimate is less pessimistic.$

Variance tradeoff

$K = n is LOO-CV$

$Leave-one-out has low bias but highly correlated fits, so model selection can still be noisy.$

Cross-validation is the most widely used method for estimating how well a model will perform on unseen data. Every time you run cross_val_score in scikit-learn, you are using CV. Every time you select hyperparameters by "trying different values and picking the best one" (whether via grid search or Bayesian optimization), you are implicitly relying on CV theory.

But CV is not just a practical recipe --- it is a statistical estimator with its own bias-variance tradeoff and failure modes. Understanding these properties tells you when to trust CV, how many folds to use, and what can go wrong. For alternatives to CV for model selection, see AIC and BIC.

Mental Model

You have $n$ data points and a learning algorithm $\mathcal{A}$ . You want to estimate how well a model trained by $\mathcal{A}$ on $n$ data points will perform on fresh data. The problem: you cannot use training data to evaluate test performance (overfitting), and you may not have a separate test set.

CV solves this by repeatedly holding out a subset of the data for testing and training on the rest. Each held-out subset gives one estimate of test performance. The average over all subsets is the CV estimate.

Formal Setup and Notation

Let $S = \{z_1, \ldots, z_n\}$ be the full training set where $z_i = (x_i, y_i)$ . Let $\mathcal{A}$ be a learning algorithm that takes a dataset and returns a hypothesis $h = \mathcal{A}(S)$ .

The true generalization error (what we want to estimate) is:

$R(\mathcal{A}, n) = \mathbb{E}_S\!\left[\mathbb{E}_{z \sim \mathcal{D}}[\ell(\mathcal{A}(S), z)]\right]$

This is the expected loss of a model trained on a random sample of size $n$ evaluated on a fresh point from $\mathcal{D}$ .

Definition

Leave-One-Out Cross-Validation (LOO-CV)

For each $i = 1, \ldots, n$ , train on all data except $z_i$ and evaluate on $z_i$ :

$\text{CV}_{\text{LOO}} = \frac{1}{n} \sum_{i=1}^n \ell(\mathcal{A}(S^{(-i)}), z_i)$

where $S^{(-i)} = S \setminus \{z_i\}$ is the dataset with the $i$ -th point removed. This requires training the model $n$ times.

Definition

K-Fold Cross-Validation

Partition $S$ into $K$ disjoint subsets $S_1, \ldots, S_K$ of roughly equal size. For each fold $k$ , train on $S \setminus S_k$ and evaluate on $S_k$ :

$\text{CV}_K = \frac{1}{n} \sum_{k=1}^K \sum_{z_i \in S_k} \ell(\mathcal{A}(S \setminus S_k), z_i)$

Common choices: $K = 5$ or $K = 10$ . LOO-CV is the special case $K = n$ .

Core Definitions

The CV estimator $\text{CV}_K$ is itself a random variable (it depends on the random partition into folds and the random training data). Like any estimator, it has bias and variance.

The bias of the CV estimator:

$\text{Bias} = \mathbb{E}[\text{CV}_K] - R(\mathcal{A}, n)$

Note the subtlety: CV trains on $n(1 - 1/K)$ points per fold, not $n$ . So it estimates the generalization error of a model trained on a smaller dataset.

The variance of the CV estimator comes from two sources: (1) random partitioning into folds and (2) correlation between fold-specific estimates (which share most of their training data).

Main Theorems

Theorem

LOO-CV is Nearly Unbiased

Statement

LOO-CV is an almost unbiased estimator of $R(\mathcal{A}, n-1)$ , the generalization error of a model trained on $n-1$ points:

$\mathbb{E}[\text{CV}_{\text{LOO}}] = R(\mathcal{A}, n-1)$

The bias relative to $R(\mathcal{A}, n)$ is:

$\text{Bias} = R(\mathcal{A}, n-1) - R(\mathcal{A}, n)$

For most consistent learning algorithms with bounded loss, $R(\mathcal{A}, n)$ is non-increasing in $n$ , so the bias is small and positive: training on $n-1$ points gives slightly worse performance than training on $n$ points, and LOO-CV slightly overestimates the true error. Monotonicity in $n$ is not a theorem, however. It can fail for unstable algorithms, for specific small- $n$ regimes, or under certain loss/algorithm pairs (Devroye, Györfi, Lugosi 1996, Section 6.8). Treat the sign of the bias as a typical-case statement, not a guarantee.

Intuition

Each left-out point $z_i$ is independent of the training set $S^{(-i)}$ (because $z_i$ was excluded), so $\ell(\mathcal{A}(S^{(-i)}), z_i)$ is an unbiased estimate of the test error for a model trained on $n-1$ points. Averaging $n$ such estimates gives an unbiased estimator of $R(\mathcal{A}, n-1)$ .

Proof Sketch

By symmetry and the i.i.d. assumption:

$\mathbb{E}[\text{CV}_{\text{LOO}}] = \frac{1}{n}\sum_{i=1}^n \mathbb{E}[\ell(\mathcal{A}(S^{(-i)}), z_i)]$

Each term $\mathbb{E}[\ell(\mathcal{A}(S^{(-i)}), z_i)]$ equals $R(\mathcal{A}, n-1)$ because $S^{(-i)}$ is a random sample of size $n-1$ and $z_i$ is an independent test point. So the average equals $R(\mathcal{A}, n-1)$ .

Why It Matters

LOO-CV's near-unbiasedness is its primary theoretical advantage. It gives an accurate estimate of generalization error on average. This makes it the gold standard for model selection in small-sample settings where bias matters most.

Failure Mode

LOO-CV has high variance. The $n$ estimates $\ell(\mathcal{A}(S^{(-i)}), z_i)$ are highly correlated because the training sets $S^{(-i)}$ and $S^{(-j)}$ overlap in $n-2$ points. High correlation between estimates means the average does not reduce variance as effectively as averaging independent quantities. This high variance makes LOO-CV unreliable for model selection when models have similar performance.

report a correction →

Theorem

CV Error and Algorithmic Stability

Statement

If the learning algorithm $\mathcal{A}$ is $\beta$ -uniformly stable (replacing any single training point changes the loss by at most $\beta$ ), then the LOO-CV estimate concentrates around the true generalization error:

$\Pr\!\left[|\text{CV}_{\text{LOO}} - R(\mathcal{A}, n)| \geq \epsilon\right] \leq 2\exp\!\left(-\frac{n\epsilon^2}{2(M + n\beta)^2}\right)$

For stable algorithms ( $\beta = O(1/n)$ ), this gives meaningful concentration.

Intuition

Stable algorithms produce similar models regardless of which single point is removed. This means the $n$ LOO estimates, despite being correlated, are all estimating approximately the same quantity. Stability bounds the sensitivity of the algorithm to individual data points, which in turn bounds the variance of the CV estimator.

report a correction →

Bias-Variance of the CV Estimator

The bias-variance tradeoff of the CV estimator itself (not the model) depends on $K$ :

Large $K$ (close to LOO):

Low bias (trains on $n - n/K \approx n$ points)
High variance (folds share most training data, estimates are correlated)

Small $K$ (e.g., $K = 2$ ):

High bias (trains on $n/2$ points, estimates error for a much smaller training set)
Low variance (folds share less data, estimates are less correlated)

The sweet spot is typically $K = 5$ or $K = 10$ . Empirical studies show that 10-fold CV has a good balance of bias and variance for most learning algorithms.

K	Training size per fold	Bias	Variance	Computation
2	$n/2$	High	Low	2 fits
5	$4n/5$	Moderate	Moderate	5 fits
10	$9n/10$	Low	Moderate	10 fits
$n$ (LOO)	$n-1$	Very low	High	$n$ fits

Proof Ideas and Templates Used

The LOO-CV unbiasedness proof uses:

Symmetry: All data points are i.i.d., so $\mathbb{E}[\ell(\mathcal{A}(S^{(-i)}), z_i)]$ is the same for all $i$ .
Independence of test point: After removing $z_i$ from the training set, $z_i$ is independent of the trained model.

The stability-based concentration bound uses:

McDiarmid's inequality: CV is a function of $n$ i.i.d. random variables, and stability bounds how much changing one variable affects the function.

Canonical Examples

Example

Model selection with K-fold CV

To choose between ridge regression with $\lambda = 0.01, 0.1, 1, 10$ : run 10-fold CV for each $\lambda$ , compute the average validation error, and select the $\lambda$ with the lowest CV error. This estimates which $\lambda$ will generalize best without a separate test set.

Example

LOO-CV for linear regression (Sherman-Morrison shortcut)

For linear regression with squared loss, the LOO-CV error can be computed from a single model fit using the hat matrix $H = X(X^TX)^{-1}X^T$ :

$\text{CV}_{\text{LOO}} = \frac{1}{n}\sum_{i=1}^n \left(\frac{y_i - \hat{y}_i}{1 - H_{ii}}\right)^2$

This costs $O(np^2)$ instead of $O(n^2 p^2)$ , making LOO-CV practical for linear models.

Common Confusions

Watch Out

CV estimates error for the algorithm, not the specific model

CV estimates $R(\mathcal{A}, n)$ : the expected error of a model produced by algorithm $\mathcal{A}$ trained on $n$ points. It does not estimate the error of the specific model trained on all $n$ points. These are different quantities. The specific model's error could be better or worse than the average.

Watch Out

Nested CV is needed for unbiased evaluation

If you use CV to select a hyperparameter and then report the CV error of the selected model, you are optimistically biased. The CV error was used for selection, so it underestimates the true error. To get an unbiased estimate, use nested CV: an outer loop for evaluation and an inner loop for model selection.

Watch Out

Stratification matters for imbalanced data

Standard K-fold randomly partitions the data, which can create folds with different class proportions for imbalanced datasets. Stratified K-fold preserves the class distribution in each fold, giving lower-variance CV estimates. Always use stratified K-fold for classification with imbalanced classes.

Summary

LOO-CV is nearly unbiased for $R(\mathcal{A}, n-1)$ but has high variance
K-fold CV with $K = 5$ or $K = 10$ balances bias and variance
Bias increases as $K$ decreases (training on smaller subsets)
Variance increases as $K$ increases (more correlation between folds)
Stability of the algorithm controls the concentration of CV estimates
Use nested CV when the same CV is used for both selection and evaluation

Optional Deeper DetailThe one-standard-error rule and conditional vs expected test errorShow

Advanced section adapted from Hastie, Tibshirani, Friedman, The Elements of Statistical Learning (2nd ed., Springer 2009), §7.10 "Cross-Validation," pp. 241-249, and §7.12 "Conditional or Expected Test Error?," pp. 254-257.

The one-standard-error rule. When you tune a complexity parameter $\theta$ (penalty, depth, number of components) by K-fold CV, the standard recipe says "pick $\hat\theta$ that minimizes the CV error curve." ESL recommends a more conservative variant: pick the most parsimonious model whose CV error is within one standard error of the minimum.

Formally, let $\hat\theta^* = \arg\min_\theta \text{CV}(\theta)$ and $\text{SE}(\theta) = \text{sd}\{\ell_k(\theta)\}_{k=1}^K / \sqrt{K}$ where $\ell_k(\theta)$ is the $k$ -th fold's loss. The one-SE choice is

$\hat\theta_{1\text{SE}} \;=\; \arg\max\{\theta \in \Theta : \text{CV}(\theta) \le \text{CV}(\hat\theta^*) + \text{SE}(\hat\theta^*)\}$

where the $\arg\max$ takes "most parsimonious" in whatever sense $\theta$ orders complexity (largest $\lambda$ for ridge/lasso, shallowest tree, fewest components for PCR).

The justification is decision-theoretic and inferential, not Bayesian. The CV curve is itself a noisy estimate of test error; the minimum's location has standard error roughly equal to the curve's local SE. Picking the exact minimum is more variable than picking any point on the "flat" region near it. Trading a small amount of fitted accuracy for a less-variable estimate of $\hat\theta$ usually improves out-of-sample performance, and it produces simpler, more interpretable models. ESL Fig. 7.9 shows a typical lasso CV curve with the two rules highlighted; the one-SE choice is consistently more parsimonious without much loss in test accuracy.

Conditional vs expected test error (the key estimand question). ESL §7.12 makes a distinction that is often glossed: are you estimating

$\text{Err}_{\mathcal T} \;=\; \mathbb E[L(Y, \hat f(X)) \mid \mathcal T] \quad\text{(conditional)}$

(the test error of the specific model fit on the specific training set $\mathcal T$ ) or

$\text{Err} \;=\; \mathbb E_{\mathcal T} \mathbb E[L(Y, \hat f(X)) \mid \mathcal T] \quad\text{(expected)}$

(the average test error over both the test point and the training set)?

The standard CV estimate is much closer to $\text{Err}$ than to $\text{Err}_{\mathcal T}$ . Folds resample the training set; the CV estimate averages over training-set fluctuations as well as test-point fluctuations. So CV estimates the expected test error of the algorithm, not the test error of the specific model you fit on the full training set.

Practical consequences:

For model selection, this is fine: you want the algorithm that generalizes best on average, not the specific fitted model.
For reporting the final model's expected performance, CV is approximately right.
For constructing a confidence interval on $\text{Err}_{\mathcal T}$ (the actual test error of the model you are deploying), CV substantially underestimates the uncertainty. The right tool is a held-out test set or refined CV with explicit accounting for the conditional variance (Bates-Hastie-Tibshirani 2024 in JASA gives a refined construction).

ESL §7.12 gives a worked simulation showing CV correlation with $\text{Err}_{\mathcal T}$ is often near zero even when CV is unbiased for $\text{Err}$ . The two quantities are simply different, and the conflation is one of the most common applied-ML errors.

Exercises

ExerciseCore

Problem

Explain why 2-fold CV has higher bias than 10-fold CV for estimating $R(\mathcal{A}, n)$ . What is the effective training set size in each case?

ExerciseAdvanced

Problem

Why does LOO-CV have high variance despite averaging $n$ estimates? What is the source of correlation between the estimates?

References

Canonical:

Stone, Cross-Validatory Choice and Assessment of Statistical Predictions (1974)
Shao, "Linear Model Selection by Cross-Validation" (JASA 1993). Connects CV to AIC/BIC and analyzes asymptotic consistency of model selection by CV.
Arlot & Celisse, A Survey of Cross-Validation Procedures for Model Selection (2010)
Hastie, T., Tibshirani, R., and Friedman, J. (2009). The Elements of Statistical Learning, 2nd ed. Springer. §7.10 (cross-validation, the one-standard-error rule, the wrong way to do CV with feature selection), §7.11 (bootstrap methods for prediction-error estimation), §7.12 (conditional vs expected test error).

Current:

Bousquet & Elisseeff, Stability and Generalization (JMLR 2002)
Bates, S., Hastie, T., and Tibshirani, R. (2024). "Cross-Validation: What Does It Estimate and How Well Does It Do It?" JASA 119(546), 1434-1445. Refined CV intervals that account for the conditional-vs-expected gap.

Next Topics

The natural next steps from cross-validation:

Algorithmic stability: the theoretical property that makes CV work
Bootstrap methods: an alternative resampling approach to estimating generalization

Last reviewed: April 26, 2026

Canonical graph

Required before and derived from this topic

These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.

Full prerequisite chain All derived topics

Required prerequisites

15

Confusion Matrices and Classification Metricslayer 1 · tier 1
Confusion Matrix: MCC, Kappa, and Cost-Sensitive Evaluationlayer 1 · tier 1
Model Evaluation Best Practiceslayer 1 · tier 1
Train-Test Split and Data Leakagelayer 1 · tier 1
AIC and BIClayer 2 · tier 1

Derived topics

4

Bootstrap Methodslayer 2 · tier 1
Split Conformal Predictionlayer 2 · tier 1
Algorithmic Stabilitylayer 3 · tier 1
Double/Debiased Machine Learninglayer 3 · tier 1

Graph-backed continuations

Algorithmic Stability Bootstrap Methods Double/Debiased Machine Learning Split Conformal Prediction