Optimization Function Classes
Cross-Validation Theory
The theory behind cross-validation as a model selection tool: LOO-CV, K-fold, the bias-variance tradeoff of the CV estimator itself, and why CV estimates generalization error.
Prerequisites
Why This Matters
Cross-validation is the most widely used method for estimating how well a
model will perform on unseen data. Every time you run cross_val_score in
scikit-learn, you are using CV. Every time you select hyperparameters by
"trying different values and picking the best one" (whether via grid search or Bayesian optimization), you are implicitly
relying on CV theory.
But CV is not just a practical recipe --- it is a statistical estimator with its own bias-variance tradeoff and failure modes. Understanding these properties tells you when to trust CV, how many folds to use, and what can go wrong. For alternatives to CV for model selection, see AIC and BIC.
Mental Model
You have data points and a learning algorithm . You want to estimate how well a model trained by on data points will perform on fresh data. The problem: you cannot use training data to evaluate test performance (overfitting), and you may not have a separate test set.
CV solves this by repeatedly holding out a subset of the data for testing and training on the rest. Each held-out subset gives one estimate of test performance. The average over all subsets is the CV estimate.
Formal Setup and Notation
Let be the full training set where . Let be a learning algorithm that takes a dataset and returns a hypothesis .
The true generalization error (what we want to estimate) is:
This is the expected loss of a model trained on a random sample of size evaluated on a fresh point from .
Leave-One-Out Cross-Validation (LOO-CV)
For each , train on all data except and evaluate on :
where is the dataset with the -th point removed. This requires training the model times.
K-Fold Cross-Validation
Partition into disjoint subsets of roughly equal size. For each fold , train on and evaluate on :
Common choices: or . LOO-CV is the special case .
Core Definitions
The CV estimator is itself a random variable (it depends on the random partition into folds and the random training data). Like any estimator, it has bias and variance.
The bias of the CV estimator:
Note the subtlety: CV trains on points per fold, not . So it estimates the generalization error of a model trained on a smaller dataset.
The variance of the CV estimator comes from two sources: (1) random partitioning into folds and (2) correlation between fold-specific estimates (which share most of their training data).
Main Theorems
LOO-CV is Nearly Unbiased
Statement
LOO-CV is an almost unbiased estimator of , the generalization error of a model trained on points:
The bias relative to is:
This is typically small and positive: training on points gives slightly worse performance than training on points, so LOO-CV slightly overestimates the true error.
Intuition
Each left-out point is independent of the training set (because was excluded), so is an unbiased estimate of the test error for a model trained on points. Averaging such estimates gives an unbiased estimator of .
Proof Sketch
By symmetry and the i.i.d. assumption:
Each term equals because is a random sample of size and is an independent test point. So the average equals .
Why It Matters
LOO-CV's near-unbiasedness is its primary theoretical advantage. It gives an accurate estimate of generalization error on average. This makes it the gold standard for model selection in small-sample settings where bias matters most.
Failure Mode
LOO-CV has high variance. The estimates are highly correlated because the training sets and overlap in points. High correlation between estimates means the average does not reduce variance as effectively as averaging independent quantities. This high variance makes LOO-CV unreliable for model selection when models have similar performance.
CV Error and Algorithmic Stability
Statement
If the learning algorithm is -uniformly stable (replacing any single training point changes the loss by at most ), then the LOO-CV estimate concentrates around the true generalization error:
For stable algorithms (), this gives meaningful concentration.
Intuition
Stable algorithms produce similar models regardless of which single point is removed. This means the LOO estimates, despite being correlated, are all estimating approximately the same quantity. Stability bounds the sensitivity of the algorithm to individual data points, which in turn bounds the variance of the CV estimator.
Bias-Variance of the CV Estimator
The bias-variance tradeoff of the CV estimator itself (not the model) depends on :
Large (close to LOO):
- Low bias (trains on points)
- High variance (folds share most training data, estimates are correlated)
Small (e.g., ):
- High bias (trains on points, estimates error for a much smaller training set)
- Low variance (folds share less data, estimates are less correlated)
The sweet spot is typically or . Empirical studies show that 10-fold CV has a good balance of bias and variance for most learning algorithms.
| K | Training size per fold | Bias | Variance | Computation |
|---|---|---|---|---|
| 2 | High | Low | 2 fits | |
| 5 | Moderate | Moderate | 5 fits | |
| 10 | Low | Moderate | 10 fits | |
| (LOO) | Very low | High | fits |
Proof Ideas and Templates Used
The LOO-CV unbiasedness proof uses:
- Symmetry: All data points are i.i.d., so is the same for all .
- Independence of test point: After removing from the training set, is independent of the trained model.
The stability-based concentration bound uses:
- McDiarmid's inequality: CV is a function of i.i.d. random variables, and stability bounds how much changing one variable affects the function.
Canonical Examples
Model selection with K-fold CV
To choose between ridge regression with : run 10-fold CV for each , compute the average validation error, and select the with the lowest CV error. This estimates which will generalize best without a separate test set.
LOO-CV for linear regression (Sherman-Morrison shortcut)
For linear regression with squared loss, the LOO-CV error can be computed from a single model fit using the hat matrix :
This costs instead of , making LOO-CV practical for linear models.
Common Confusions
CV estimates error for the algorithm, not the specific model
CV estimates : the expected error of a model produced by algorithm trained on points. It does not estimate the error of the specific model trained on all points. These are different quantities. The specific model's error could be better or worse than the average.
Nested CV is needed for unbiased evaluation
If you use CV to select a hyperparameter and then report the CV error of the selected model, you are optimistically biased. The CV error was used for selection, so it underestimates the true error. To get an unbiased estimate, use nested CV: an outer loop for evaluation and an inner loop for model selection.
Stratification matters for imbalanced data
Standard K-fold randomly partitions the data, which can create folds with different class proportions for imbalanced datasets. Stratified K-fold preserves the class distribution in each fold, giving lower-variance CV estimates. Always use stratified K-fold for classification with imbalanced classes.
Summary
- LOO-CV is nearly unbiased for but has high variance
- K-fold CV with or balances bias and variance
- Bias increases as decreases (training on smaller subsets)
- Variance increases as increases (more correlation between folds)
- Stability of the algorithm controls the concentration of CV estimates
- Use nested CV when the same CV is used for both selection and evaluation
Exercises
Problem
Explain why 2-fold CV has higher bias than 10-fold CV for estimating . What is the effective training set size in each case?
Problem
Why does LOO-CV have high variance despite averaging estimates? What is the source of correlation between the estimates?
References
Canonical:
- Stone, Cross-Validatory Choice and Assessment of Statistical Predictions (1974)
- Arlot & Celisse, A Survey of Cross-Validation Procedures for Model Selection (2010)
Current:
-
Bousquet & Elisseeff, Stability and Generalization (JMLR 2002)
-
Boyd & Vandenberghe, Convex Optimization (2004), Chapters 2-5
-
Nesterov, Introductory Lectures on Convex Optimization (2004), Chapters 1-3
-
Nocedal & Wright, Numerical Optimization (2006), Chapters 2-7
Next Topics
The natural next steps from cross-validation:
- Algorithmic stability: the theoretical property that makes CV work
- Bootstrap methods: an alternative resampling approach to estimating generalization
Last reviewed: April 2026
Prerequisites
Foundations this topic depends on.