Optimization Function Classes
Bias-Variance Tradeoff
The classical decomposition of mean squared error into bias squared, variance, and irreducible noise. The U-shaped test error curve, why it breaks in modern ML (double descent), and the connection to regularization.
Why This Matters
The bias-variance decomposition is one of the core frameworks in classical statistics and supervised learning. It explains why model selection matters: simple models underfit (high bias), complex models overfit (high variance), and the best classical model balances the two. The decomposition is exact for squared-error loss; its classical U-curve prediction breaks down in the overparameterized regime (see double descent).
Every regularization technique (ridge regression, lasso, dropout, early stopping) can be understood as a mechanism for controlling the bias-variance tradeoff, though the equivalences are regime-specific. Dropout reduces to a data-dependent penalty in linear regression (Wager, Wang, Liang 2013, arXiv 1307.1493; Baldi and Sadowski 2013, arXiv 1312.6197), but does not admit a clean bias-variance decomposition for deep non-linear networks. Early stopping on full-batch GD with squared loss is approximately -equivalent (Yao, Rosasco, Caponnetto 2007, "On Early Stopping in Gradient Descent Learning"), but this breaks for SGD on non-convex losses. Every model selection procedure (cross-validation, AIC/BIC) is trying to find the optimal tradeoff point.
Understanding the bias-variance tradeoff is also essential for understanding where it breaks. In modern overparameterized ML, the U-shaped test error curve gives way to double descent, and the classical tradeoff becomes only half the story.
Mental Model
Imagine throwing darts at a target:
- Low bias, low variance: Darts cluster tightly around the center. The predictions are both accurate (close to the truth) and consistent (similar across different training sets).
- High bias, low variance: Darts cluster tightly but away from the center. The predictions are consistently wrong in the same way. Underfitting.
- Low bias, high variance: Darts are scattered around the center. On average they are right, but any individual prediction can be far off. Overfitting.
- High bias, high variance: Darts are scattered and away from the center. The worst case.
The tradeoff arises because reducing bias (making the model more flexible) typically increases variance (the model becomes more sensitive to the specific training data), and vice versa.
The Formal Decomposition
Bias-Variance Decomposition of MSE
Statement
Let be the true regression function, and let be the prediction of a model trained on a random training set . The expected squared error at a point is:
In short:
The expectation is over random draws of the training set (and the noise in ). The bias measures systematic error; the variance measures sensitivity to the training set; the noise is inherent and cannot be reduced by any model.
Intuition
The total prediction error comes from three sources:
-
Bias: How far off is the average prediction (averaged over all possible training sets) from the truth? A model that is too simple will consistently miss the true function. This is a systematic error.
-
Variance: How much does the prediction change when you swap in a different training set? A model that is too complex will give very different predictions depending on which particular training examples it saw. This is instability.
-
Noise: Even with the perfect model, the data has inherent randomness (). No model can predict better than the noise floor.
The decomposition is exact: these three quantities sum to the total MSE with no cross-terms.
Proof Sketch
Fix the test point . Let with the fresh test-point noise, and let be the training set, with the standard assumption . The outer expectation is jointly over and .
Add and subtract inside the squared error:
Expand the square. The cross-terms vanish because:
- is independent of (the test noise is independent of , and depends only on )
- has zero -mean by construction
- has zero mean
This leaves three squared terms: noise, bias squared, and variance. If one also averages over , apply an outer expectation to each term. The training-set noise is distinct from and is already packaged inside the distribution of .
Why It Matters
This decomposition is the theoretical foundation for model selection. It tells you that there are exactly two knobs you can turn (bias and variance), and they work against each other. Every practical technique for improving generalization can be understood through this lens: regularization increases bias but decreases variance; more data decreases variance without affecting bias; ensembling decreases variance; feature selection can decrease both.
Failure Mode
The decomposition above is specific to squared loss. For 0-1 loss there is a clean additive decomposition using different definitions of bias and variance: Domingos 2000, "A Unified Bias-Variance Decomposition and its Applications," gives a single framework covering squared and 0-1 loss; see also Kohavi and Wolpert 1996, "Bias Plus Variance Decomposition for Zero-One Loss Functions," and Breiman 1998, "Arcing Classifiers." The 0-1 decomposition has non-obvious features (variance can reduce error when it flips systematically-wrong predictions), so the framework is different, not merely "more complex." Also, this decomposition describes the classical picture. In the overparameterized regime, the variance is non-monotonic (it decreases past the interpolation threshold under isotropic design), breaking the simple U-shaped tradeoff.
Definitions
Bias
The bias of a model at point is the difference between the true function and the expected prediction:
Bias measures systematic error: how far the average model prediction is from the truth. High bias means the model class is too restrictive to capture the true relationship.
Variance
The variance of a model at point is the expected squared deviation of the prediction from its mean:
Variance measures instability: how much the prediction changes across different training sets. High variance means the model is too sensitive to the specific training data.
Irreducible Error
The irreducible error (or noise) is:
This is the inherent randomness in the data. No model, no matter how flexible, can predict better than this. It sets the floor for achievable test error.
The U-Shaped Curve
The classical bias-variance tradeoff produces a U-shaped test error curve as model complexity increases:
-
Low complexity (underfitting): Bias is high (the model cannot capture the true function). Variance is low (a simple model gives similar predictions regardless of training data). Test error is dominated by bias.
-
Increasing complexity: Bias decreases as the model becomes more expressive. Variance increases as the model becomes more sensitive to training data. Test error decreases as bias reduction outweighs variance increase.
-
Optimal complexity: The minimum of the U-curve. Bias and variance contributions are balanced.
-
High complexity (overfitting): Bias is low (the model can fit the true function well). Variance is very high (the model fits the noise in each specific training set). Test error is dominated by variance.
Training error monotonically decreases with complexity. Test error first decreases then increases. The gap between training and test error is the generalization gap.
Canonical Examples
Polynomial regression
Fit a polynomial of degree to data generated from with Gaussian noise:
- (linear): high bias (a line cannot capture a sine wave), low variance. Underfitting.
- : moderate bias, moderate variance. Good fit.
- : low bias (the polynomial can approximate any smooth function), high variance (the polynomial oscillates wildly between data points). Overfitting.
The optimal degree depends on the sample size : more data allows higher degree without excessive variance.
Bias-Variance for Specific Models
Linear Regression
For linear regression with features and samples (assuming and the true model is linear):
The zero-bias claim requires three conditions: (a) correct specification (the true conditional mean is linear in the same features), (b) OLS with no regularization, and (c) the expectation taken over the training set with the design fixed or exchangeable. Under (a), , and correct specification then gives zero prediction bias at every .
The pointwise variance at a fixed test point is . The familiar form
arises as an asymptotic / averaged quantity: either as the average of over a test point drawn from the same distribution as the training (so that matches the design moment), or under a random Gaussian design with in the regime. That is the regime used throughout this page. See Hastie, Tibshirani, Friedman, Elements of Statistical Learning (2nd ed.), §7.3.
In this averaged form the variance scales linearly with the number of parameters and inversely with . This is why adding features (increasing ) increases variance, and collecting more data (increasing ) decreases it.
If the true model is not linear, there is also a bias term that depends on how well the linear class approximates the truth.
Ridge Regression
Ridge regression adds regularization: minimize . The estimator is .
To read off bias and variance cleanly, diagonalize . Let with and rotate into that eigenbasis (define , for the true parameter). Then each coordinate shrinks independently:
Summing over coordinates:
In words: directions with small eigenvalue (poorly identified features) get shrunk hard, trading a lot of variance for a small amount of bias. Directions with large are barely touched.
A useful special case: if (orthonormal design), every and the formulas simplify to
So bias grows and variance shrinks with , but both through the factor ; bias is not simply unless you take .
This is why cross-validation over works: it finds the operating point on the tradeoff curve that minimizes test error for the actual eigenspectrum of your data.
K-Nearest Neighbors
Bias-Variance for K-Nearest Neighbors
Statement
For -nearest neighbors regression:
The variance is conditional on the neighbor positions: averaging over neighbors reduces the noise component by a factor of . The nearest neighbors are themselves random (they depend on the training design ), which introduces additional variance that this simplification suppresses. The bias scales as because the distance from to its -th neighbor scales as under a bounded-below density, and the Lipschitz assumption converts distance into function-value error. See Györfi, Kohler, Krzyżak, Walk, A Distribution-Free Theory of Nonparametric Regression, Theorem 6.2, and Biau and Devroye, Lectures on the Nearest Neighbor Method.
Balancing over gives
derived from , i.e. , so . This depends heavily on the dimension (the curse of dimensionality). At this gives ; at , ; at large , approaches .
Intuition
With : the prediction is the label of the single nearest neighbor. Zero bias (asymptotically) but maximum variance (): the prediction depends entirely on one noisy observation.
With : the prediction is the average of all labels. Maximum bias (ignores all local structure) but minimum variance (): the prediction is the same regardless of the query point.
Increasing averages over more neighbors, reducing variance but smoothing out local structure (increasing bias).
Why It Matters
KNN provides the cleanest example of the bias-variance tradeoff because the "complexity parameter" () maps directly to variance (). It also illustrates the curse of dimensionality: in high dimensions, the bias term grows slowly with , meaning you need very large (and thus very large ) for the bias to dominate.
Why the Classical Picture Breaks
The U-shaped curve assumes that variance increases monotonically with model complexity. In the overparameterized regime (), this assumption fails:
- At the interpolation threshold (): variance peaks (the system is exactly determined and ill-conditioned).
- Past the threshold (): under isotropic Gaussian design, variance decreases as increases further (Hastie, Montanari, Rosset, Tibshirani 2022, "Surprises in High-Dimensional Ridgeless Least Squares Interpolation"). The minimum-norm interpolator spreads its weights across more dimensions, reducing the variance contribution. This clean picture is specific to the isotropic or near-isotropic spectrum; for anisotropic feature covariance the variance can be non-monotonic at very large , and Nakkiran et al. 2019 show empirically that the variance curve need not be monotone even in the overparameterized regime.
This produces the double descent curve: the classical U-shape followed by a second descent. The bias-variance decomposition is still mathematically valid in the overparameterized regime; what changes is the behavior of the variance term (it is no longer monotonically increasing).
The classical tradeoff is a correct description of the underparameterized world. It is incomplete for the overparameterized world where modern deep learning operates. Understanding both regimes is essential.
Common Confusions
The optimum is not where the bias and variance curves cross
The minimum of test MSE sits at the minimum of , not where the two curves intersect. Formally, the first-order optimality condition at complexity parameter is
That is, the two slopes cancel. Intersection of the two curves () is a different equation and generally sits at a different point. Intersection coincides with the minimum only when both curves are locally symmetric around that point, which is a special case, not the rule. If variance rises faster than bias falls near the intersection, the optimum sits to the left of the crossing; if bias falls faster than variance rises, the optimum sits to the right. Always read the total-error curve, never the crossing point.
Bias-variance tradeoff is about the model class, not a single model
The bias and variance are defined over the randomness of the training set. For a single fixed model trained on a single fixed dataset, there is no tradeoff. The model either performs well or it does not. The tradeoff is a property of the procedure: how would this modeling approach perform across many possible training sets drawn from the same distribution?
Low training error does not mean low bias
Training error measures how well the model fits the data it was trained on. Bias measures how well the average model (over all possible training sets) approximates the true function. A model can have low training error because it memorizes the training data (low bias but also fitting noise), or because it genuinely captures the underlying pattern. These are different situations with different implications.
More data reduces variance, not bias
Collecting more training data helps by reducing variance (the model's predictions become more stable). It does not reduce bias: if the model class cannot represent the true function, no amount of data will fix that. A linear model fitting a quadratic function has irreducible bias that more data cannot eliminate.
The decomposition is specific to squared loss
The clean additive decomposition MSE = Bias + Variance + Noise above holds for squared loss. For classification with 0-1 loss there is a different additive decomposition due to Domingos 2000, using different definitions of bias and variance; see also Kohavi and Wolpert 1996 and Breiman 1998. Under these definitions variance can sometimes help (reducing error when instability flips systematically-wrong predictions toward the correct label). The framework is different from the squared-loss one, not just "more complex."
Summary
- MSE = Bias + Variance + Irreducible Noise (exact decomposition)
- Bias: systematic error from model limitations (too simple)
- Variance: instability from sensitivity to training data (too complex)
- Noise: inherent data randomness, cannot be reduced
- Classical tradeoff: increasing complexity decreases bias, increases variance, producing a U-shaped test error curve
- Regularization controls the tradeoff: trades bias for reduced variance
- KNN: variance = , bias , optimal balances both
- This picture breaks in overparameterized regimes: variance is non-monotonic, leading to double descent
Exercises
Problem
A dataset has irreducible noise . You fit a linear regression with 10 features to samples. Assuming the true model is linear, compute the expected test MSE averaged over a fresh test point drawn from the same distribution as the training inputs. How does it change if you use 50 features? Use the averaged variance form .
Problem
Derive the bias-variance decomposition for -nearest neighbors regression. Show that the variance is exactly when the noise is independent with variance .
Problem
Ridge regression with penalty produces estimates . Show that as , the estimate approaches OLS (low bias, high variance), and as , the estimate approaches zero (high bias, zero variance). Explain qualitatively why there exists an optimal .
Problem
The classical bias-variance tradeoff predicts a U-shaped test error curve. Double descent shows a second descent past the interpolation threshold. Explain, using the bias-variance decomposition, how the variance term can be non-monotonic. Specifically, why does variance decrease in the overparameterized regime despite the model having more parameters?
References
Canonical:
- Geman, Bienenstock, Doursat, "Neural Networks and the Bias/Variance Dilemma" (Neural Computation, 1992). The paper that made the decomposition central to ML; derives it for the squared-error regression setting and discusses why variance dominates for flexible estimators.
- Hastie, Tibshirani, Friedman, The Elements of Statistical Learning (2nd ed., 2009), §7.3 (bias-variance decomposition), §7.10 (cross-validation for the tradeoff), §7.12 (conditional vs. expected test error).
- Wasserman, All of Statistics (2004), §6.3 (MSE decomposition), §20.3 (bias-variance for nonparametric regression).
Classical and textbook extensions:
- Bishop, Pattern Recognition and Machine Learning (2006), §3.2 (explicit bias-variance decomposition with a Bayesian lens).
- Domingos, "A Unified Bias-Variance Decomposition and its Applications" (ICML, 2000). Extends the decomposition beyond squared error to 0/1 loss and general loss functions.
Modern / overparameterized regime:
- Belkin, Hsu, Ma, Mandal, "Reconciling Modern Machine Learning Practice and the Classical Bias-Variance Trade-off" (PNAS, 2019). Where the U-curve breaks.
- Neal, Mittal, Baratin, Tantia, Scicluna, Lacoste-Julien, Mitliagkas, "A Modern Take on the Bias-Variance Tradeoff in Neural Networks" (ICLR, 2019). Empirical variance decomposition across widths.
- Nakkiran, Kaplun, Bansal, Yang, Barak, Sutskever, "Deep Double Descent: Where Bigger Models and More Data Hurt" (ICLR, 2020). Connects bias-variance to the double-descent curve.
- Adlam & Pennington, "Understanding Double Descent Requires a Fine-Grained Bias-Variance Decomposition" (NeurIPS, 2020). Splits variance into initialization, sampling, and label-noise components; the key finding is a multi-peak structure where the three variance components peak at different points along the complexity axis, so the total variance curve is not captured by any single coarse decomposition.
Next Topics
The natural next steps from the bias-variance tradeoff:
- Ridge regression: explicit bias-variance control via regularization
- Double descent: where the classical U-curve fails and a second descent appears
- Implicit bias and modern generalization: why the full picture requires understanding the algorithm, not just the hypothesis class
Last reviewed: April 2026
Prerequisites
Foundations this topic depends on.
Builds on This
- Cross-Validation TheoryLayer 2
- Decision Trees and EnsemblesLayer 2
- Overfitting and UnderfittingLayer 1
- Regularization TheoryLayer 2