Bias-Variance Tradeoff

Sneiderman, Robby

Optimization Function Classes

Bias-Variance Tradeoff

The classical decomposition of mean squared error into bias squared, variance, and irreducible noise. The U-shaped test error curve, why it breaks in modern ML (double descent), and the connection to regularization.

CoreTier 2StableSupporting~55 min

Prerequisites

Expectation Variance Covariance Moments Empirical Risk Minimization Elastic Net Elements of Statistical Learning Book

Start 8-question practice · 34 available 3-question pulse check Prereq Map

Learning position

Read this page in the graph.

optimization-function-classes | layer 2 | tier 2. This page has 8 direct prerequisites and 11 published dependents.

Open Atlas Prerequisites Leads to

What next

Ridge Regression

This is the first curated or graph-derived continuation from the current page.

Evidence badge

Source-grounded page

This page has no public Lean mapping yet. Use the evidence page to inspect how claim status labels work.

Show the backing system

AtlasOpen the full prerequisite graph and run grounding traces.EvidenceInspect source support, claim labels, and public trust status.LeanReview the checked declaration list, scopes, and axiom profile.

Why This Matters

The bias-variance decomposition is one of the core frameworks in classical statistics and supervised learning. It explains why model selection matters: simple models underfit (high bias), complex models overfit (high variance), and the best classical model balances the two. The decomposition is exact for squared-error loss; its classical U-curve prediction breaks down in the overparameterized regime (see double descent).

Every regularization technique (ridge regression, lasso, dropout, early stopping) can be understood as a mechanism for controlling the bias-variance tradeoff, though the equivalences are regime-specific. Dropout reduces to a data-dependent $L_2$ penalty in linear regression (Wager, Wang, Liang 2013, arXiv 1307.1493; Baldi and Sadowski 2013, arXiv 1312.6197), but does not admit a clean bias-variance decomposition for deep non-linear networks. Early stopping on full-batch GD with squared loss is approximately $L_2$ -equivalent (Yao, Rosasco, Caponnetto 2007, "On Early Stopping in Gradient Descent Learning"), but this breaks for SGD on non-convex losses. Every model selection procedure (cross-validation, AIC/BIC) is trying to find the optimal tradeoff point.

Understanding the bias-variance tradeoff is also essential for understanding where it breaks. In modern overparameterized ML, the U-shaped test error curve gives way to double descent, and the classical tradeoff becomes only half the story.

Model complexity:degree 3

Good fit

Mental Model

Imagine throwing darts at a target:

Low bias, low variance: Darts cluster tightly around the center. The predictions are both accurate (close to the truth) and consistent (similar across different training sets).
High bias, low variance: Darts cluster tightly but away from the center. The predictions are consistently wrong in the same way. Underfitting.
Low bias, high variance: Darts are scattered around the center. On average they are right, but any individual prediction can be far off. Overfitting.
High bias, high variance: Darts are scattered and away from the center. The worst case.

The tradeoff arises because reducing bias (making the model more flexible) typically increases variance (the model becomes more sensitive to the specific training data), and vice versa.

The Formal Decomposition

Theorem

Bias-Variance Decomposition of MSE

Statement

Let $f(x) = \mathbb{E}[Y | X = x]$ be the true regression function, and let $\hat{f}_S(x)$ be the prediction of a model trained on a random training set $S$ . The expected squared error at a point $x$ is:

$\mathbb{E}_S\left[(Y - \hat{f}_S(x))^2\right] = \underbrace{\left(f(x) - \mathbb{E}_S[\hat{f}_S(x)]\right)^2}_{\text{Bias}^2(x)} + \underbrace{\mathbb{E}_S\left[(\hat{f}_S(x) - \mathbb{E}_S[\hat{f}_S(x)])^2\right]}_{\text{Variance}(x)} + \underbrace{\mathbb{E}\left[(Y - f(x))^2\right]}_{\sigma^2(x)}$

In short:

$\text{MSE}(x) = \text{Bias}^2(x) + \text{Variance}(x) + \text{Irreducible Noise}$

The expectation is over random draws of the training set $S$ (and the noise in $Y$ ). The bias measures systematic error; the variance measures sensitivity to the training set; the noise is inherent and cannot be reduced by any model.

Intuition

The total prediction error comes from three sources:

Bias: How far off is the average prediction (averaged over all possible training sets) from the truth? A model that is too simple will consistently miss the true function. This is a systematic error.
Variance: How much does the prediction change when you swap in a different training set? A model that is too complex will give very different predictions depending on which particular training examples it saw. This is instability.
Noise: Even with the perfect model, the data has inherent randomness ( $Y = f(x) + \epsilon$ ). No model can predict better than the noise floor.

The decomposition is exact: these three quantities sum to the total MSE with no cross-terms.

Proof Sketch

Fix the test point $x$ . Let $Y = f(x) + \epsilon_{\text{test}}$ with $\epsilon_{\text{test}}$ the fresh test-point noise, and let $S$ be the training set, with the standard assumption $S \perp \epsilon_{\text{test}}$ . The outer expectation $\mathbb{E}$ is jointly over $S$ and $\epsilon_{\text{test}}$ .

Add and subtract $\mathbb{E}_S[\hat{f}_S(x)]$ inside the squared error:

$\mathbb{E}[(Y - \hat{f}_S(x))^2] = \mathbb{E}[(Y - f(x) + f(x) - \mathbb{E}_S[\hat{f}_S(x)] + \mathbb{E}_S[\hat{f}_S(x)] - \hat{f}_S(x))^2]$

Expand the square. The cross-terms vanish because:

$Y - f(x) = \epsilon_{\text{test}}$ is independent of $\hat{f}_S(x) - \mathbb{E}_S[\hat{f}_S(x)]$ (the test noise is independent of $S$ , and $\hat{f}_S$ depends only on $S$ )
$\mathbb{E}_S[\hat{f}_S(x)] - \hat{f}_S(x)$ has zero $S$ -mean by construction
$\epsilon_{\text{test}}$ has zero mean

This leaves three squared terms: noise, bias squared, and variance. If one also averages over $x$ , apply an outer expectation $\mathbb{E}_x$ to each term. The training-set noise is distinct from $\epsilon_{\text{test}}$ and is already packaged inside the distribution of $S$ .

Why It Matters

This decomposition is the theoretical foundation for model selection. It tells you that there are exactly two knobs you can turn—bias and variance—and they work against each other. Every practical technique for improving generalization can be understood through this lens: regularization increases bias but decreases variance; more data decreases variance without affecting bias; ensembling decreases variance; feature selection can decrease both.

Failure Mode

The decomposition above is specific to squared loss. For 0-1 loss there is a clean additive decomposition using different definitions of bias and variance: Domingos 2000, "A Unified Bias-Variance Decomposition and its Applications," gives a single framework covering squared and 0-1 loss; see also Kohavi and Wolpert 1996, "Bias Plus Variance Decomposition for Zero-One Loss Functions," and Breiman 1998, "Arcing Classifiers." The 0-1 decomposition has non-obvious features (variance can reduce error when it flips systematically-wrong predictions), so the framework is different, not merely "more complex." Also, this decomposition describes the classical picture. In the overparameterized regime, the variance is non-monotonic (it decreases past the interpolation threshold under isotropic design), breaking the simple U-shaped tradeoff.

report a correction →

Definitions

Definition

Bias

The bias of a model at point $x$ is the difference between the true function and the expected prediction:

$\text{Bias}(x) = f(x) - \mathbb{E}_S[\hat{f}_S(x)]$

Bias measures systematic error: how far the average model prediction is from the truth. High bias means the model class is too restrictive to capture the true relationship.

Definition

Variance

The variance of a model at point $x$ is the expected squared deviation of the prediction from its mean:

$\text{Var}(x) = \mathbb{E}_S\left[(\hat{f}_S(x) - \mathbb{E}_S[\hat{f}_S(x)])^2\right]$

Variance measures instability: how much the prediction changes across different training sets. High variance means the model is too sensitive to the specific training data.

Definition

Irreducible Error

The irreducible error (or noise) is:

$\sigma^2(x) = \mathbb{E}[(Y - f(x))^2] = \text{Var}(\epsilon | X = x)$

This is the inherent randomness in the data. No model, no matter how flexible, can predict better than this. It sets the floor for achievable test error.

The U-Shaped Curve

The classical bias-variance tradeoff produces a U-shaped test error curve as model complexity increases:

Low complexity (underfitting): Bias is high (the model cannot capture the true function). Variance is low (a simple model gives similar predictions regardless of training data). Test error is dominated by bias.
Increasing complexity: Bias decreases as the model becomes more expressive. Variance increases as the model becomes more sensitive to training data. Test error decreases as bias reduction outweighs variance increase.
Optimal complexity: The minimum of the U-curve. Bias and variance contributions are balanced.
High complexity (overfitting): Bias is low (the model can fit the true function well). Variance is very high (the model fits the noise in each specific training set). Test error is dominated by variance.

Training error monotonically decreases with complexity. Test error first decreases then increases. The gap between training and test error is the generalization gap.

Canonical Examples

Example

Polynomial regression

Fit a polynomial of degree $d$ to data generated from $f(x) = \sin(x)$ with Gaussian noise:

$d = 1$ (linear): high bias (a line cannot capture a sine wave), low variance. Underfitting.
$d = 3$ : moderate bias, moderate variance. Good fit.
$d = 15$ : low bias (the polynomial can approximate any smooth function), high variance (the polynomial oscillates wildly between data points). Overfitting.

The optimal degree depends on the sample size $n$ : more data allows higher degree without excessive variance.

Bias-Variance for Specific Models

Linear Regression

For linear regression with $d$ features and $n$ samples (assuming $d < n$ and the true model is linear):

$\text{Bias}^2 = 0$ $\text{Variance}(x) = \sigma^2 \cdot x^\top (X^\top X)^{-1} x$

The zero-bias claim requires three conditions: (a) correct specification (the true conditional mean is linear in the same features), (b) OLS with no regularization, and (c) the expectation taken over the training set $S$ with the design $X$ fixed or exchangeable. Under (a), $\mathbb{E}_S[\hat\beta] = \beta$ , and correct specification then gives zero prediction bias at every $x$ .

The pointwise variance at a fixed test point $x$ is $\sigma^2 \, x^\top(X^\top X)^{-1} x$ . The familiar form

$\overline{\text{Variance}} = \sigma^2 \cdot \frac{d}{n}$

arises as an asymptotic / averaged quantity: either as the average of $\sigma^2 x^\top(X^\top X)^{-1} x$ over a test point drawn from the same distribution as the training $x_i$ (so that $\mathbb{E}[xx^\top] = \Sigma$ matches the design moment), or under a random Gaussian design with $x_i \sim \mathcal{N}(0, \Sigma)$ in the $n \to \infty$ regime. That is the regime used throughout this page. See Hastie, Tibshirani, Friedman, Elements of Statistical Learning (2nd ed.), §7.3.

In this averaged form the variance scales linearly with the number of parameters $d$ and inversely with $n$ . This is why adding features (increasing $d$ ) increases variance, and collecting more data (increasing $n$ ) decreases it.

If the true model is not linear, there is also a bias term that depends on how well the linear class approximates the truth.

Ridge Regression

Ridge regression adds $L_2$ regularization: minimize $\|Xw - y\|^2 + \lambda\|w\|^2$ . The estimator is $\hat w_\lambda = (X^\top X + \lambda I)^{-1} X^\top y$ .

To read off bias and variance cleanly, diagonalize $X^\top X$ . Let $X^\top X = V D V^\top$ with $D = \mathrm{diag}(d_1, \dots, d_p)$ and rotate into that eigenbasis (define $\tilde w = V^\top w$ , $\tilde\beta = V^\top \beta$ for the true parameter). Then each coordinate $i$ shrinks independently:

$\mathbb{E}[\tilde{\hat w}_{\lambda,i}] = \frac{d_i}{d_i + \lambda}\, \tilde\beta_i, \qquad \mathrm{Var}(\tilde{\hat w}_{\lambda,i}) = \sigma^2 \frac{d_i}{(d_i + \lambda)^2}.$

Summing over coordinates:

$\text{Bias}^2(\hat w_\lambda) = \sum_{i=1}^{p} \left(\frac{\lambda}{d_i + \lambda}\right)^2 \tilde\beta_i^2, \qquad \text{Variance}(\hat w_\lambda) = \sigma^2 \sum_{i=1}^{p} \frac{d_i}{(d_i + \lambda)^2}.$

In words: directions with small eigenvalue $d_i$ (poorly identified features) get shrunk hard, trading a lot of variance for a small amount of bias. Directions with large $d_i$ are barely touched.

A useful special case: if $X^\top X = I$ (orthonormal design), every $d_i = 1$ and the formulas simplify to

$\text{Bias}^2 = \|\beta\|^2 \cdot \frac{\lambda^2}{(1+\lambda)^2}, \qquad \text{Variance} = \frac{p \sigma^2}{(1+\lambda)^2}.$

So bias grows and variance shrinks with $\lambda$ , but both through the factor $1/(1+\lambda)^2$ ; bias is not simply $\propto \lambda^2$ unless you take $\lambda \to 0$ .

This is why cross-validation over $\lambda$ works: it finds the operating point on the tradeoff curve that minimizes test error for the actual eigenspectrum of your data.

K-Nearest Neighbors

Proposition

Bias-Variance for K-Nearest Neighbors

Statement

For $K$ -nearest neighbors regression,

$\text{Bias}^2 \;\propto\; \left(\tfrac{K}{n}\right)^{2/d}, \qquad \text{Variance} \;=\; \tfrac{\sigma^2}{K}.$

Bias increases with $K$ and decreases with $n$ ; variance decreases with $K$ , conditional on fixed neighbor positions.

The variance is $\sigma^2 / K$ conditional on the neighbor positions: averaging over $K$ neighbors reduces the noise component by a factor of $K$ . The $K$ nearest neighbors are themselves random (they depend on the training design $X$ ), which introduces additional variance that this simplification suppresses. The bias scales as $(K/n)^{2/d}$ because the distance from $x$ to its $K$ -th neighbor scales as $(K/n)^{1/d}$ under a bounded-below density, and the Lipschitz assumption converts distance into function-value error. See Györfi, Kohler, Krzyżak, Walk, A Distribution-Free Theory of Nonparametric Regression, Theorem 6.2, and Biau and Devroye, Lectures on the Nearest Neighbor Method.

Balancing $(K/n)^{2/d} + \sigma^2/K$ over $K$ gives

$K^* \propto n^{2/(d+2)},$

derived from $\frac{d}{dK}\left[(K/n)^{2/d} + \sigma^2/K\right] = 0$ , i.e. $K^{2/d+1} \propto n^{2/d}$ , so $K^* \propto n^{2/(d+2)}$ . The exponent $2/(d+2)$ is always strictly less than $1$ , so $K^*$ grows sub-linearly in $n$ at every finite dimension and never approaches $n$ itself. At $d=1$ this gives $K^* \propto n^{2/3}$ ; at $d=2$ , $K^* \propto n^{1/2}$ ; as $d \to \infty$ the exponent $2/(d+2) \to 0$ , so $K^*$ saturates at a constant. Increasing $d$ pushes $K^*$ from "a small but growing fraction of $n$ " toward "a fixed constant", reflecting the curse of dimensionality.

Intuition

With $K = 1$ : the prediction is the label of the single nearest neighbor. Zero bias (asymptotically) but maximum variance ( $\sigma^2$ ): the prediction depends entirely on one noisy observation.

With $K = n$ : the prediction is the average of all labels. Maximum bias (ignores all local structure) but minimum variance ( $\sigma^2/n$ ): the prediction is the same regardless of the query point.

Increasing $K$ averages over more neighbors, reducing variance but smoothing out local structure (increasing bias).

Why It Matters

KNN provides the cleanest example of the bias-variance tradeoff because the "complexity parameter" ( $K$ ) maps directly to variance ( $\sigma^2/K$ ). It also illustrates the curse of dimensionality: in high dimensions, the bias term $(K/n)^{2/d}$ grows slowly with $K$ , meaning you need very large $K$ (and thus very large $n$ ) for the bias to dominate.

report a correction →

Why the Classical Picture Breaks

The U-shaped curve assumes that variance increases monotonically with model complexity. In the overparameterized regime ( $d > n$ ), this assumption fails:

At the interpolation threshold ( $d = n$ ): variance peaks (the system is exactly determined and ill-conditioned).
Past the threshold ( $d > n$ ): under isotropic Gaussian design, variance decreases as $d$ increases further (Hastie, Montanari, Rosset, Tibshirani 2022, "Surprises in High-Dimensional Ridgeless Least Squares Interpolation"). The minimum-norm interpolator spreads its weights across more dimensions, reducing the variance contribution. This clean picture is specific to the isotropic or near-isotropic spectrum; for anisotropic feature covariance the variance can be non-monotonic at very large $d$ , and Nakkiran et al. 2019 show empirically that the variance curve need not be monotone even in the overparameterized regime.

This produces the double descent curve: the classical U-shape followed by a second descent. The bias-variance decomposition is still mathematically valid in the overparameterized regime—what changes is the behavior of the variance term, which is no longer monotonically increasing.

The classical tradeoff is a correct description of the underparameterized world. It is incomplete for the overparameterized world where modern deep learning operates. Understanding both regimes is essential.

Common Confusions

Watch Out

The optimum is not where the bias and variance curves cross

The minimum of test MSE sits at the minimum of $\mathrm{Bias}^2 + \mathrm{Variance}$ , not where the two curves intersect. Formally, the first-order optimality condition at complexity parameter $K$ is

$\frac{d}{dK}\bigl[\mathrm{Bias}^2(K) + \mathrm{Variance}(K)\bigr] = 0, \quad \text{i.e.} \quad \frac{d \, \mathrm{Bias}^2}{dK} = -\frac{d \, \mathrm{Variance}}{dK}.$

That is, the two slopes cancel. Intersection of the two curves ( $\mathrm{Bias}^2 = \mathrm{Variance}$ ) is a different equation and generally sits at a different point. Intersection coincides with the minimum only when both curves are locally symmetric around that point, which is a special case, not the rule. If variance rises faster than bias falls near the intersection, the optimum sits to the left of the crossing; if bias falls faster than variance rises, the optimum sits to the right. Always read the total-error curve, never the crossing point.

Watch Out

Bias-variance tradeoff is about the model class, not a single model

The bias and variance are defined over the randomness of the training set. For a single fixed model trained on a single fixed dataset, there is no tradeoff. The model either performs well or it does not. The tradeoff is a property of the procedure: how would this modeling approach perform across many possible training sets drawn from the same distribution?

Watch Out

Low training error does not mean low bias

Training error measures how well the model fits the data it was trained on. Bias measures how well the average model (over all possible training sets) approximates the true function. A model can have low training error because it memorizes the training data (low bias but also fitting noise), or because it genuinely captures the underlying pattern. These are different situations with different implications.

Watch Out

More data reduces variance — and sometimes bias too

For a fixed parametric model class, collecting more data primarily reduces variance (predictions become more stable around the best in-class predictor) but does not reduce approximation error (bias from model misspecification): a linear model fitting a quadratic function has approximation error that more data cannot eliminate.

This is more nuanced for nonparametric or adaptive procedures whose effective complexity grows with $n$ :

kNN with $K = K_n \to \infty$ but $K_n / n \to 0$ has bias $\propto K_n^{2/d}$ and variance $\sigma^2 / K_n$ , both shrinking with $n$ under a properly chosen $K_n$ (Stone 1977; Györfi-Kohler-Krzyżak-Walk 2002).
Kernel regression with bandwidth $h_n \to 0$ at the right rate has both bias and variance shrinking.
Sieve estimators / model selection (e.g.\ growing the polynomial degree, RKHS with growing penalty schedule, increasing-width neural nets with appropriate regularization) trade off approximation and estimation error and can drive both to zero with more data.

So the precise statement is: more data reduces estimation error (roughly the variance term); it reduces approximation error only when the procedure is allowed to use a richer hypothesis class as $n$ grows.

Watch Out

The decomposition is specific to squared loss

The clean additive decomposition MSE = Bias $^2$ + Variance + Noise above holds for squared loss. For classification with 0-1 loss there is a different additive decomposition due to Domingos 2000, using different definitions of bias and variance; see also Kohavi and Wolpert 1996 and Breiman 1998. Under these definitions variance can sometimes help (reducing error when instability flips systematically-wrong predictions toward the correct label). The framework is different from the squared-loss one, not just "more complex."

Summary

MSE = Bias $^2$ + Variance + Irreducible Noise (exact decomposition)
Bias: systematic error from model limitations (too simple)
Variance: instability from sensitivity to training data (too complex)
Noise: inherent data randomness, cannot be reduced
Classical tradeoff: increasing complexity decreases bias, increases variance, producing a U-shaped test error curve
Regularization controls the tradeoff: $\lambda$ trades bias for reduced variance
KNN: variance = $\sigma^2/K$ , bias $\propto K^{2/d}$ , optimal $K$ balances both
This picture breaks in overparameterized regimes: variance is non-monotonic, leading to double descent

Optional Deeper DetailThe optimism of training error and the covariance penaltyShow

Advanced section adapted from Hastie, Tibshirani, Friedman, The Elements of Statistical Learning (2nd ed., Springer 2009), §7.4 "The Effective Number of Parameters" and §7.5 "Optimism of the Training Error Rate," pp. 230-235.

The bias-variance decomposition explains why test error exceeds training error in expectation. The optimism formula makes this precise: it gives a closed-form expression for the gap, and it justifies the entire family of penalized model-selection scores (AIC, BIC, Mallows' $C_p$ ).

Setup. Train a model on $S = \{(x_i, y_i)\}_{i=1}^n$ to get predictions $\hat y_i = \hat f(x_i)$ . Training error at the in-sample design points is

$\overline{\text{err}} \;=\; \frac{1}{n}\sum_{i=1}^n L(y_i, \hat y_i).$

In-sample test error at the same $x_i$ but a fresh response $y^{\text{new}}_i$ drawn from the conditional $Y \mid X = x_i$ is

$\text{Err}_{\text{in}} \;=\; \frac{1}{n}\sum_{i=1}^n \mathbb E_{y^{\text{new}}_i}[L(y^{\text{new}}_i, \hat y_i)].$

The optimism is the gap

$\text{op} \;=\; \mathbb E_y[\text{Err}_{\text{in}} - \overline{\text{err}}].$

The covariance theorem (ESL Theorem 7.21 / eq. 7.20). Under squared-error loss with $\operatorname{Var}(y_i) = \sigma^2$ ,

$\text{op} \;=\; \frac{2}{n}\sum_{i=1}^n \operatorname{Cov}(\hat y_i, y_i).$

The optimism is twice the per-sample covariance between the fit and the response. Intuitively, the more the model "moves with" the noise in $y$ , the more it overfits, and the larger the train-test gap.

Specialization to linear smoothers. When $\hat y = H y$ (ridge, OLS, kernel ridge, spline smoothers, any linear smoother),

$\operatorname{Cov}(\hat y_i, y_i) \;=\; \sigma^2 H_{ii}, \qquad \sum_i \operatorname{Cov}(\hat y_i, y_i) \;=\; \sigma^2 \operatorname{tr}(H),$

so

$\text{op} \;=\; \frac{2 \sigma^2}{n} \operatorname{tr}(H) \;=\; \frac{2 \sigma^2}{n} \, \text{df}(\lambda).$

For OLS in $d$ dimensions, $\operatorname{tr}(H) = d$ and the optimism is $2 \sigma^2 d / n$ . This is the formal justification for the $d/n$ scaling in the variance: it appears as the optimism of training error.

Connection to model-selection scores. Add the optimism back to training error to get an in-sample estimate of test error:

Mallows' $C_p$ : $C_p = \overline{\text{err}} + 2 \sigma^2 d / n$ for linear regression with $d$ features. This is the optimism formula with $\text{tr}(H) = d$ .
AIC: $\text{AIC} = -2 \log L + 2 d$ for likelihood models, again with the factor-of-2 from the optimism theorem.
GIC and Mallows-like generalizations: replace $d$ with $\text{tr}(H)$ for nonparametric linear smoothers, recovering effective-degrees-of-freedom variants.

Why the factor of 2 is real, not arbitrary. The optimism gap arises because $\hat f$ is fit to $y$ , so $\hat y_i$ and $y_i$ are positively correlated. Re-evaluating with a fresh $y^{\text{new}}_i$ from the same distribution breaks that correlation. The factor of 2 comes from a covariance algebra: $\mathbb E[(y_i - \hat y_i)^2] - \mathbb E[(y^{\text{new}}_i - \hat y_i)^2] = -2 \operatorname{Cov}(\hat y_i, y_i)$ when $y^{\text{new}}_i$ is independent of $\hat y_i$ . The sign flip is what makes training error optimistic and test error pessimistic by exactly this gap.

This formula is the bridge from bias-variance to applied model selection. Once you know $\text{tr}(H)$ for your smoother (closed form for ridge, easy for splines and kernel smoothers, harder for trees and neural nets where the "effective number of parameters" is a real research question), you have a one-shot in-sample test-error estimate without needing held-out data or CV. The trade-off is that the formula assumes well-specified linear-Gaussian conditions; CV is more robust to misspecification at higher computational cost.

Exercises

ExerciseCore

Problem

A dataset has irreducible noise $\sigma^2 = 0.5$ . You fit a linear regression with 10 features to $n = 100$ samples. Assuming the true model is linear, compute the expected test MSE averaged over a fresh test point $x$ drawn from the same distribution as the training inputs. How does it change if you use 50 features? Use the averaged variance form $\sigma^2 d/n$ .

ExerciseCore

Problem

Derive the bias-variance decomposition for $K$ -nearest neighbors regression conditional on the training inputs $X_1, \ldots, X_n$ (so the neighbor positions are fixed). Show that under this conditioning, the conditional variance at a test point $x$ equals $\sigma^2 / K$ when the label noise is i.i.d.\ with variance $\sigma^2$ .

ExerciseAdvanced

Problem

Ridge regression with penalty $\lambda$ produces estimates $\hat{w}_\lambda = (X^\top X + \lambda I)^{-1} X^\top y$ . Show that as $\lambda \to 0$ , the estimate approaches OLS (low bias, high variance), and as $\lambda \to \infty$ , the estimate approaches zero (high bias, zero variance). Explain qualitatively why there exists an optimal $\lambda^*$ .

ExerciseResearch

Problem

The classical bias-variance tradeoff predicts a U-shaped test error curve. Double descent shows a second descent past the interpolation threshold. Explain, using the bias-variance decomposition, how the variance term can be non-monotonic. Specifically, why does variance decrease in the overparameterized regime despite the model having more parameters?

References

Canonical:

Geman, Bienenstock, Doursat, "Neural Networks and the Bias/Variance Dilemma" (Neural Computation, 1992). The paper that made the decomposition central to ML; derives it for the squared-error regression setting and discusses why variance dominates for flexible estimators.
Hastie, Tibshirani, Friedman, The Elements of Statistical Learning (2nd ed., 2009), §7.3 (bias-variance decomposition), §7.10 (cross-validation for the tradeoff), §7.12 (conditional vs. expected test error).
Wasserman, All of Statistics (2004), §6.3 (MSE decomposition), §20.3 (bias-variance for nonparametric regression).

Classical and textbook extensions:

Bishop, Pattern Recognition and Machine Learning (2006), §3.2 (explicit bias-variance decomposition with a Bayesian lens).
Domingos, "A Unified Bias-Variance Decomposition and its Applications" (ICML, 2000). Extends the decomposition beyond squared error to 0/1 loss and general loss functions.

Modern / overparameterized regime:

Belkin, Hsu, Ma, Mandal, "Reconciling Modern Machine Learning Practice and the Classical Bias-Variance Trade-off" (PNAS, 2019). Where the U-curve breaks.
Neal, Mittal, Baratin, Tantia, Scicluna, Lacoste-Julien, Mitliagkas, "A Modern Take on the Bias-Variance Tradeoff in Neural Networks" (ICLR, 2019). Empirical variance decomposition across widths.
Nakkiran, Kaplun, Bansal, Yang, Barak, Sutskever, "Deep Double Descent: Where Bigger Models and More Data Hurt" (ICLR, 2020). Connects bias-variance to the double-descent curve.
Adlam & Pennington, "Understanding Double Descent Requires a Fine-Grained Bias-Variance Decomposition" (NeurIPS, 2020). Splits variance into initialization, sampling, and label-noise components; the key finding is a multi-peak structure where the three variance components peak at different points along the complexity axis, so the total variance curve is not captured by any single coarse decomposition.

Next Topics

The natural next steps from the bias-variance tradeoff:

Ridge regression: explicit bias-variance control via $L_2$ regularization
Double descent: where the classical U-curve fails and a second descent appears
Implicit bias and modern generalization: why the full picture requires understanding the algorithm, not just the hypothesis class

Last reviewed: April 26, 2026

Canonical graph

Required before and derived from this topic

These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.

Full prerequisite chain All derived topics

Required prerequisites

8

Expectation, Variance, Covariance, and Momentslayer 0A · tier 1
The Elements of Statistical Learning (Hastie, Tibshirani, Friedman)layer 0B · tier 1
Linear Regressionlayer 1 · tier 1
Empirical Risk Minimizationlayer 2 · tier 1
K-Nearest Neighborslayer 1 · tier 2

Derived topics

11

Kernel Density Estimationlayer 2 · tier 1
Local Polynomial Regressionlayer 2 · tier 1
Nadaraya-Watson Kernel Regressionlayer 2 · tier 1
Overfitting and Underfittinglayer 2 · tier 1
Smoothing Splineslayer 2 · tier 1

+6 more on the derived-topics page.

Graph-backed continuations

Double Descent Implicit Bias and Modern Generalization Cross-Validation Theory Decision Trees and Ensembles Overfitting and Underfitting Regularization Theory