Beta. Content is under active construction and has not been peer-reviewed. Report errors on GitHub.Disclaimer

Optimization Function Classes

Cross-Validation Theory

The theory behind cross-validation as a model selection tool: LOO-CV, K-fold, the bias-variance tradeoff of the CV estimator itself, and why CV estimates generalization error.

CoreTier 2Stable~45 min

Why This Matters

Fold 1Fold 2Fold 3Fold 4Fold 5Iter 1Iter 2Iter 3Iter 4Iter 5TrainValidateEach fold validates once. Average the 5 scores.

Cross-validation is the most widely used method for estimating how well a model will perform on unseen data. Every time you run cross_val_score in scikit-learn, you are using CV. Every time you select hyperparameters by "trying different values and picking the best one" (whether via grid search or Bayesian optimization), you are implicitly relying on CV theory.

But CV is not just a practical recipe --- it is a statistical estimator with its own bias-variance tradeoff and failure modes. Understanding these properties tells you when to trust CV, how many folds to use, and what can go wrong. For alternatives to CV for model selection, see AIC and BIC.

Mental Model

You have nn data points and a learning algorithm A\mathcal{A}. You want to estimate how well a model trained by A\mathcal{A} on nn data points will perform on fresh data. The problem: you cannot use training data to evaluate test performance (overfitting), and you may not have a separate test set.

CV solves this by repeatedly holding out a subset of the data for testing and training on the rest. Each held-out subset gives one estimate of test performance. The average over all subsets is the CV estimate.

Formal Setup and Notation

Let S={z1,,zn}S = \{z_1, \ldots, z_n\} be the full training set where zi=(xi,yi)z_i = (x_i, y_i). Let A\mathcal{A} be a learning algorithm that takes a dataset and returns a hypothesis h=A(S)h = \mathcal{A}(S).

The true generalization error (what we want to estimate) is:

R(A,n)=ES ⁣[EzD[(A(S),z)]]R(\mathcal{A}, n) = \mathbb{E}_S\!\left[\mathbb{E}_{z \sim \mathcal{D}}[\ell(\mathcal{A}(S), z)]\right]

This is the expected loss of a model trained on a random sample of size nn evaluated on a fresh point from D\mathcal{D}.

Definition

Leave-One-Out Cross-Validation (LOO-CV)

For each i=1,,ni = 1, \ldots, n, train on all data except ziz_i and evaluate on ziz_i:

CVLOO=1ni=1n(A(S(i)),zi)\text{CV}_{\text{LOO}} = \frac{1}{n} \sum_{i=1}^n \ell(\mathcal{A}(S^{(-i)}), z_i)

where S(i)=S{zi}S^{(-i)} = S \setminus \{z_i\} is the dataset with the ii-th point removed. This requires training the model nn times.

Definition

K-Fold Cross-Validation

Partition SS into KK disjoint subsets S1,,SKS_1, \ldots, S_K of roughly equal size. For each fold kk, train on SSkS \setminus S_k and evaluate on SkS_k:

CVK=1nk=1KziSk(A(SSk),zi)\text{CV}_K = \frac{1}{n} \sum_{k=1}^K \sum_{z_i \in S_k} \ell(\mathcal{A}(S \setminus S_k), z_i)

Common choices: K=5K = 5 or K=10K = 10. LOO-CV is the special case K=nK = n.

Core Definitions

The CV estimator CVK\text{CV}_K is itself a random variable (it depends on the random partition into folds and the random training data). Like any estimator, it has bias and variance.

The bias of the CV estimator:

Bias=E[CVK]R(A,n)\text{Bias} = \mathbb{E}[\text{CV}_K] - R(\mathcal{A}, n)

Note the subtlety: CV trains on n(11/K)n(1 - 1/K) points per fold, not nn. So it estimates the generalization error of a model trained on a smaller dataset.

The variance of the CV estimator comes from two sources: (1) random partitioning into folds and (2) correlation between fold-specific estimates (which share most of their training data).

Main Theorems

Theorem

LOO-CV is Nearly Unbiased

Statement

LOO-CV is an almost unbiased estimator of R(A,n1)R(\mathcal{A}, n-1), the generalization error of a model trained on n1n-1 points:

E[CVLOO]=R(A,n1)\mathbb{E}[\text{CV}_{\text{LOO}}] = R(\mathcal{A}, n-1)

The bias relative to R(A,n)R(\mathcal{A}, n) is:

Bias=R(A,n1)R(A,n)0\text{Bias} = R(\mathcal{A}, n-1) - R(\mathcal{A}, n) \geq 0

This is typically small and positive: training on n1n-1 points gives slightly worse performance than training on nn points, so LOO-CV slightly overestimates the true error.

Intuition

Each left-out point ziz_i is independent of the training set S(i)S^{(-i)} (because ziz_i was excluded), so (A(S(i)),zi)\ell(\mathcal{A}(S^{(-i)}), z_i) is an unbiased estimate of the test error for a model trained on n1n-1 points. Averaging nn such estimates gives an unbiased estimator of R(A,n1)R(\mathcal{A}, n-1).

Proof Sketch

By symmetry and the i.i.d. assumption:

E[CVLOO]=1ni=1nE[(A(S(i)),zi)]\mathbb{E}[\text{CV}_{\text{LOO}}] = \frac{1}{n}\sum_{i=1}^n \mathbb{E}[\ell(\mathcal{A}(S^{(-i)}), z_i)]

Each term E[(A(S(i)),zi)]\mathbb{E}[\ell(\mathcal{A}(S^{(-i)}), z_i)] equals R(A,n1)R(\mathcal{A}, n-1) because S(i)S^{(-i)} is a random sample of size n1n-1 and ziz_i is an independent test point. So the average equals R(A,n1)R(\mathcal{A}, n-1).

Why It Matters

LOO-CV's near-unbiasedness is its primary theoretical advantage. It gives an accurate estimate of generalization error on average. This makes it the gold standard for model selection in small-sample settings where bias matters most.

Failure Mode

LOO-CV has high variance. The nn estimates (A(S(i)),zi)\ell(\mathcal{A}(S^{(-i)}), z_i) are highly correlated because the training sets S(i)S^{(-i)} and S(j)S^{(-j)} overlap in n2n-2 points. High correlation between estimates means the average does not reduce variance as effectively as averaging independent quantities. This high variance makes LOO-CV unreliable for model selection when models have similar performance.

Theorem

CV Error and Algorithmic Stability

Statement

If the learning algorithm A\mathcal{A} is β\beta-uniformly stable (replacing any single training point changes the loss by at most β\beta), then the LOO-CV estimate concentrates around the true generalization error:

Pr ⁣[CVLOOR(A,n)ϵ]2exp ⁣(nϵ22(M+nβ)2)\Pr\!\left[|\text{CV}_{\text{LOO}} - R(\mathcal{A}, n)| \geq \epsilon\right] \leq 2\exp\!\left(-\frac{n\epsilon^2}{2(M + n\beta)^2}\right)

For stable algorithms (β=O(1/n)\beta = O(1/n)), this gives meaningful concentration.

Intuition

Stable algorithms produce similar models regardless of which single point is removed. This means the nn LOO estimates, despite being correlated, are all estimating approximately the same quantity. Stability bounds the sensitivity of the algorithm to individual data points, which in turn bounds the variance of the CV estimator.

Bias-Variance of the CV Estimator

The bias-variance tradeoff of the CV estimator itself (not the model) depends on KK:

Large KK (close to LOO):

  • Low bias (trains on nn/Knn - n/K \approx n points)
  • High variance (folds share most training data, estimates are correlated)

Small KK (e.g., K=2K = 2):

  • High bias (trains on n/2n/2 points, estimates error for a much smaller training set)
  • Low variance (folds share less data, estimates are less correlated)

The sweet spot is typically K=5K = 5 or K=10K = 10. Empirical studies show that 10-fold CV has a good balance of bias and variance for most learning algorithms.

KTraining size per foldBiasVarianceComputation
2n/2n/2HighLow2 fits
54n/54n/5ModerateModerate5 fits
109n/109n/10LowModerate10 fits
nn (LOO)n1n-1Very lowHighnn fits

Proof Ideas and Templates Used

The LOO-CV unbiasedness proof uses:

  1. Symmetry: All data points are i.i.d., so E[(A(S(i)),zi)]\mathbb{E}[\ell(\mathcal{A}(S^{(-i)}), z_i)] is the same for all ii.
  2. Independence of test point: After removing ziz_i from the training set, ziz_i is independent of the trained model.

The stability-based concentration bound uses:

  1. McDiarmid's inequality: CV is a function of nn i.i.d. random variables, and stability bounds how much changing one variable affects the function.

Canonical Examples

Example

Model selection with K-fold CV

To choose between ridge regression with λ=0.01,0.1,1,10\lambda = 0.01, 0.1, 1, 10: run 10-fold CV for each λ\lambda, compute the average validation error, and select the λ\lambda with the lowest CV error. This estimates which λ\lambda will generalize best without a separate test set.

Example

LOO-CV for linear regression (Sherman-Morrison shortcut)

For linear regression with squared loss, the LOO-CV error can be computed from a single model fit using the hat matrix H=X(XTX)1XTH = X(X^TX)^{-1}X^T:

CVLOO=1ni=1n(yiy^i1Hii)2\text{CV}_{\text{LOO}} = \frac{1}{n}\sum_{i=1}^n \left(\frac{y_i - \hat{y}_i}{1 - H_{ii}}\right)^2

This costs O(np2)O(np^2) instead of O(n2p2)O(n^2 p^2), making LOO-CV practical for linear models.

Common Confusions

Watch Out

CV estimates error for the algorithm, not the specific model

CV estimates R(A,n)R(\mathcal{A}, n): the expected error of a model produced by algorithm A\mathcal{A} trained on nn points. It does not estimate the error of the specific model trained on all nn points. These are different quantities. The specific model's error could be better or worse than the average.

Watch Out

Nested CV is needed for unbiased evaluation

If you use CV to select a hyperparameter and then report the CV error of the selected model, you are optimistically biased. The CV error was used for selection, so it underestimates the true error. To get an unbiased estimate, use nested CV: an outer loop for evaluation and an inner loop for model selection.

Watch Out

Stratification matters for imbalanced data

Standard K-fold randomly partitions the data, which can create folds with different class proportions for imbalanced datasets. Stratified K-fold preserves the class distribution in each fold, giving lower-variance CV estimates. Always use stratified K-fold for classification with imbalanced classes.

Summary

  • LOO-CV is nearly unbiased for R(A,n1)R(\mathcal{A}, n-1) but has high variance
  • K-fold CV with K=5K = 5 or K=10K = 10 balances bias and variance
  • Bias increases as KK decreases (training on smaller subsets)
  • Variance increases as KK increases (more correlation between folds)
  • Stability of the algorithm controls the concentration of CV estimates
  • Use nested CV when the same CV is used for both selection and evaluation

Exercises

ExerciseCore

Problem

Explain why 2-fold CV has higher bias than 10-fold CV for estimating R(A,n)R(\mathcal{A}, n). What is the effective training set size in each case?

ExerciseAdvanced

Problem

Why does LOO-CV have high variance despite averaging nn estimates? What is the source of correlation between the estimates?

References

Canonical:

  • Stone, Cross-Validatory Choice and Assessment of Statistical Predictions (1974)
  • Arlot & Celisse, A Survey of Cross-Validation Procedures for Model Selection (2010)

Current:

  • Bousquet & Elisseeff, Stability and Generalization (JMLR 2002)

  • Boyd & Vandenberghe, Convex Optimization (2004), Chapters 2-5

  • Nesterov, Introductory Lectures on Convex Optimization (2004), Chapters 1-3

  • Nocedal & Wright, Numerical Optimization (2006), Chapters 2-7

Next Topics

The natural next steps from cross-validation:

Last reviewed: April 2026

Prerequisites

Foundations this topic depends on.

Next Topics