Beta. Content is under active construction and has not been peer-reviewed. Report errors on GitHub.Disclaimer

Optimization Function Classes

Bias-Variance Tradeoff

The classical decomposition of mean squared error into bias squared, variance, and irreducible noise. The U-shaped test error curve, why it breaks in modern ML (double descent), and the connection to regularization.

CoreTier 2Stable~55 min

Why This Matters

irreducible noiseoptimalunderfittingoverfittingErrorModel complexityBias²VarianceTotal error

The bias-variance decomposition is one of the core frameworks in classical statistics and supervised learning. It explains why model selection matters: simple models underfit (high bias), complex models overfit (high variance), and the best classical model balances the two. The decomposition is exact for squared-error loss; its classical U-curve prediction breaks down in the overparameterized regime (see double descent).

Every regularization technique (ridge regression, lasso, dropout, early stopping) can be understood as a mechanism for controlling the bias-variance tradeoff, though the equivalences are regime-specific. Dropout reduces to a data-dependent L2L_2 penalty in linear regression (Wager, Wang, Liang 2013, arXiv 1307.1493; Baldi and Sadowski 2013, arXiv 1312.6197), but does not admit a clean bias-variance decomposition for deep non-linear networks. Early stopping on full-batch GD with squared loss is approximately L2L_2-equivalent (Yao, Rosasco, Caponnetto 2007, "On Early Stopping in Gradient Descent Learning"), but this breaks for SGD on non-convex losses. Every model selection procedure (cross-validation, AIC/BIC) is trying to find the optimal tradeoff point.

Understanding the bias-variance tradeoff is also essential for understanding where it breaks. In modern overparameterized ML, the U-shaped test error curve gives way to double descent, and the classical tradeoff becomes only half the story.

degree 3
Good fit
Error vs. Model ComplexityTraining errorTest errorMSE03691215Polynomial degreeModel Fit (degree 3)True functionModel fit

Mental Model

Imagine throwing darts at a target:

  • Low bias, low variance: Darts cluster tightly around the center. The predictions are both accurate (close to the truth) and consistent (similar across different training sets).
  • High bias, low variance: Darts cluster tightly but away from the center. The predictions are consistently wrong in the same way. Underfitting.
  • Low bias, high variance: Darts are scattered around the center. On average they are right, but any individual prediction can be far off. Overfitting.
  • High bias, high variance: Darts are scattered and away from the center. The worst case.

The tradeoff arises because reducing bias (making the model more flexible) typically increases variance (the model becomes more sensitive to the specific training data), and vice versa.

The Formal Decomposition

Theorem

Bias-Variance Decomposition of MSE

Statement

Let f(x)=E[YX=x]f(x) = \mathbb{E}[Y | X = x] be the true regression function, and let f^S(x)\hat{f}_S(x) be the prediction of a model trained on a random training set SS. The expected squared error at a point xx is:

ES[(Yf^S(x))2]=(f(x)ES[f^S(x)])2Bias2(x)+ES[(f^S(x)ES[f^S(x)])2]Variance(x)+E[(Yf(x))2]σ2(x)\mathbb{E}_S\left[(Y - \hat{f}_S(x))^2\right] = \underbrace{\left(f(x) - \mathbb{E}_S[\hat{f}_S(x)]\right)^2}_{\text{Bias}^2(x)} + \underbrace{\mathbb{E}_S\left[(\hat{f}_S(x) - \mathbb{E}_S[\hat{f}_S(x)])^2\right]}_{\text{Variance}(x)} + \underbrace{\mathbb{E}\left[(Y - f(x))^2\right]}_{\sigma^2(x)}

In short:

MSE(x)=Bias2(x)+Variance(x)+Irreducible Noise\text{MSE}(x) = \text{Bias}^2(x) + \text{Variance}(x) + \text{Irreducible Noise}

The expectation is over random draws of the training set SS (and the noise in YY). The bias measures systematic error; the variance measures sensitivity to the training set; the noise is inherent and cannot be reduced by any model.

Intuition

The total prediction error comes from three sources:

  1. Bias: How far off is the average prediction (averaged over all possible training sets) from the truth? A model that is too simple will consistently miss the true function. This is a systematic error.

  2. Variance: How much does the prediction change when you swap in a different training set? A model that is too complex will give very different predictions depending on which particular training examples it saw. This is instability.

  3. Noise: Even with the perfect model, the data has inherent randomness (Y=f(x)+ϵY = f(x) + \epsilon). No model can predict better than the noise floor.

The decomposition is exact: these three quantities sum to the total MSE with no cross-terms.

Proof Sketch

Fix the test point xx. Let Y=f(x)+ϵtestY = f(x) + \epsilon_{\text{test}} with ϵtest\epsilon_{\text{test}} the fresh test-point noise, and let SS be the training set, with the standard assumption SϵtestS \perp \epsilon_{\text{test}}. The outer expectation E\mathbb{E} is jointly over SS and ϵtest\epsilon_{\text{test}}.

Add and subtract ES[f^S(x)]\mathbb{E}_S[\hat{f}_S(x)] inside the squared error:

E[(Yf^S(x))2]=E[(Yf(x)+f(x)ES[f^S(x)]+ES[f^S(x)]f^S(x))2]\mathbb{E}[(Y - \hat{f}_S(x))^2] = \mathbb{E}[(Y - f(x) + f(x) - \mathbb{E}_S[\hat{f}_S(x)] + \mathbb{E}_S[\hat{f}_S(x)] - \hat{f}_S(x))^2]

Expand the square. The cross-terms vanish because:

  • Yf(x)=ϵtestY - f(x) = \epsilon_{\text{test}} is independent of f^S(x)ES[f^S(x)]\hat{f}_S(x) - \mathbb{E}_S[\hat{f}_S(x)] (the test noise is independent of SS, and f^S\hat{f}_S depends only on SS)
  • ES[f^S(x)]f^S(x)\mathbb{E}_S[\hat{f}_S(x)] - \hat{f}_S(x) has zero SS-mean by construction
  • ϵtest\epsilon_{\text{test}} has zero mean

This leaves three squared terms: noise, bias squared, and variance. If one also averages over xx, apply an outer expectation Ex\mathbb{E}_x to each term. The training-set noise is distinct from ϵtest\epsilon_{\text{test}} and is already packaged inside the distribution of SS.

Why It Matters

This decomposition is the theoretical foundation for model selection. It tells you that there are exactly two knobs you can turn (bias and variance), and they work against each other. Every practical technique for improving generalization can be understood through this lens: regularization increases bias but decreases variance; more data decreases variance without affecting bias; ensembling decreases variance; feature selection can decrease both.

Failure Mode

The decomposition above is specific to squared loss. For 0-1 loss there is a clean additive decomposition using different definitions of bias and variance: Domingos 2000, "A Unified Bias-Variance Decomposition and its Applications," gives a single framework covering squared and 0-1 loss; see also Kohavi and Wolpert 1996, "Bias Plus Variance Decomposition for Zero-One Loss Functions," and Breiman 1998, "Arcing Classifiers." The 0-1 decomposition has non-obvious features (variance can reduce error when it flips systematically-wrong predictions), so the framework is different, not merely "more complex." Also, this decomposition describes the classical picture. In the overparameterized regime, the variance is non-monotonic (it decreases past the interpolation threshold under isotropic design), breaking the simple U-shaped tradeoff.

Definitions

Definition

Bias

The bias of a model at point xx is the difference between the true function and the expected prediction:

Bias(x)=f(x)ES[f^S(x)]\text{Bias}(x) = f(x) - \mathbb{E}_S[\hat{f}_S(x)]

Bias measures systematic error: how far the average model prediction is from the truth. High bias means the model class is too restrictive to capture the true relationship.

Definition

Variance

The variance of a model at point xx is the expected squared deviation of the prediction from its mean:

Var(x)=ES[(f^S(x)ES[f^S(x)])2]\text{Var}(x) = \mathbb{E}_S\left[(\hat{f}_S(x) - \mathbb{E}_S[\hat{f}_S(x)])^2\right]

Variance measures instability: how much the prediction changes across different training sets. High variance means the model is too sensitive to the specific training data.

Definition

Irreducible Error

The irreducible error (or noise) is:

σ2(x)=E[(Yf(x))2]=Var(ϵX=x)\sigma^2(x) = \mathbb{E}[(Y - f(x))^2] = \text{Var}(\epsilon | X = x)

This is the inherent randomness in the data. No model, no matter how flexible, can predict better than this. It sets the floor for achievable test error.

The U-Shaped Curve

The classical bias-variance tradeoff produces a U-shaped test error curve as model complexity increases:

  1. Low complexity (underfitting): Bias is high (the model cannot capture the true function). Variance is low (a simple model gives similar predictions regardless of training data). Test error is dominated by bias.

  2. Increasing complexity: Bias decreases as the model becomes more expressive. Variance increases as the model becomes more sensitive to training data. Test error decreases as bias reduction outweighs variance increase.

  3. Optimal complexity: The minimum of the U-curve. Bias and variance contributions are balanced.

  4. High complexity (overfitting): Bias is low (the model can fit the true function well). Variance is very high (the model fits the noise in each specific training set). Test error is dominated by variance.

Training error monotonically decreases with complexity. Test error first decreases then increases. The gap between training and test error is the generalization gap.

Canonical Examples

Example

Polynomial regression

Fit a polynomial of degree dd to data generated from f(x)=sin(x)f(x) = \sin(x) with Gaussian noise:

  • d=1d = 1 (linear): high bias (a line cannot capture a sine wave), low variance. Underfitting.
  • d=3d = 3: moderate bias, moderate variance. Good fit.
  • d=15d = 15: low bias (the polynomial can approximate any smooth function), high variance (the polynomial oscillates wildly between data points). Overfitting.

The optimal degree depends on the sample size nn: more data allows higher degree without excessive variance.

Bias-Variance for Specific Models

Linear Regression

For linear regression with dd features and nn samples (assuming d<nd < n and the true model is linear):

Bias2=0\text{Bias}^2 = 0 Variance(x)=σ2x(XX)1x\text{Variance}(x) = \sigma^2 \cdot x^\top (X^\top X)^{-1} x

The zero-bias claim requires three conditions: (a) correct specification (the true conditional mean is linear in the same features), (b) OLS with no regularization, and (c) the expectation taken over the training set SS with the design XX fixed or exchangeable. Under (a), ES[β^]=β\mathbb{E}_S[\hat\beta] = \beta, and correct specification then gives zero prediction bias at every xx.

The pointwise variance at a fixed test point xx is σ2x(XX)1x\sigma^2 \, x^\top(X^\top X)^{-1} x. The familiar form

Variance=σ2dn\overline{\text{Variance}} = \sigma^2 \cdot \frac{d}{n}

arises as an asymptotic / averaged quantity: either as the average of σ2x(XX)1x\sigma^2 x^\top(X^\top X)^{-1} x over a test point drawn from the same distribution as the training xix_i (so that E[xx]=Σ\mathbb{E}[xx^\top] = \Sigma matches the design moment), or under a random Gaussian design with xiN(0,Σ)x_i \sim \mathcal{N}(0, \Sigma) in the nn \to \infty regime. That is the regime used throughout this page. See Hastie, Tibshirani, Friedman, Elements of Statistical Learning (2nd ed.), §7.3.

In this averaged form the variance scales linearly with the number of parameters dd and inversely with nn. This is why adding features (increasing dd) increases variance, and collecting more data (increasing nn) decreases it.

If the true model is not linear, there is also a bias term that depends on how well the linear class approximates the truth.

Ridge Regression

Ridge regression adds L2L_2 regularization: minimize Xwy2+λw2\|Xw - y\|^2 + \lambda\|w\|^2. The estimator is w^λ=(XX+λI)1Xy\hat w_\lambda = (X^\top X + \lambda I)^{-1} X^\top y.

To read off bias and variance cleanly, diagonalize XXX^\top X. Let XX=VDVX^\top X = V D V^\top with D=diag(d1,,dp)D = \mathrm{diag}(d_1, \dots, d_p) and rotate into that eigenbasis (define w~=Vw\tilde w = V^\top w, β~=Vβ\tilde\beta = V^\top \beta for the true parameter). Then each coordinate ii shrinks independently:

E[w^~λ,i]=didi+λβ~i,Var(w^~λ,i)=σ2di(di+λ)2.\mathbb{E}[\tilde{\hat w}_{\lambda,i}] = \frac{d_i}{d_i + \lambda}\, \tilde\beta_i, \qquad \mathrm{Var}(\tilde{\hat w}_{\lambda,i}) = \sigma^2 \frac{d_i}{(d_i + \lambda)^2}.

Summing over coordinates:

Bias2(w^λ)=i=1p(λdi+λ)2β~i2,Variance(w^λ)=σ2i=1pdi(di+λ)2.\text{Bias}^2(\hat w_\lambda) = \sum_{i=1}^{p} \left(\frac{\lambda}{d_i + \lambda}\right)^2 \tilde\beta_i^2, \qquad \text{Variance}(\hat w_\lambda) = \sigma^2 \sum_{i=1}^{p} \frac{d_i}{(d_i + \lambda)^2}.

In words: directions with small eigenvalue did_i (poorly identified features) get shrunk hard, trading a lot of variance for a small amount of bias. Directions with large did_i are barely touched.

A useful special case: if XX=IX^\top X = I (orthonormal design), every di=1d_i = 1 and the formulas simplify to

Bias2=β2λ2(1+λ)2,Variance=pσ2(1+λ)2.\text{Bias}^2 = \|\beta\|^2 \cdot \frac{\lambda^2}{(1+\lambda)^2}, \qquad \text{Variance} = \frac{p \sigma^2}{(1+\lambda)^2}.

So bias grows and variance shrinks with λ\lambda, but both through the factor 1/(1+λ)21/(1+\lambda)^2; bias is not simply λ2\propto \lambda^2 unless you take λ0\lambda \to 0.

This is why cross-validation over λ\lambda works: it finds the operating point on the tradeoff curve that minimizes test error for the actual eigenspectrum of your data.

K-Nearest Neighbors

Proposition

Bias-Variance for K-Nearest Neighbors

Statement

For KK-nearest neighbors regression:

Bias2(Kn)2/d(increases with K, decreases with n)\text{Bias}^2 \propto \left(\frac{K}{n}\right)^{2/d} \quad \text{(increases with } K \text{, decreases with } n \text{)} Variance=σ2K(decreases with K, conditional on fixed neighbors)\text{Variance} = \frac{\sigma^2}{K} \quad \text{(decreases with } K \text{, conditional on fixed neighbors)}

The variance is σ2/K\sigma^2 / K conditional on the neighbor positions: averaging over KK neighbors reduces the noise component by a factor of KK. The KK nearest neighbors are themselves random (they depend on the training design XX), which introduces additional variance that this simplification suppresses. The bias scales as (K/n)2/d(K/n)^{2/d} because the distance from xx to its KK-th neighbor scales as (K/n)1/d(K/n)^{1/d} under a bounded-below density, and the Lipschitz assumption converts distance into function-value error. See Györfi, Kohler, Krzyżak, Walk, A Distribution-Free Theory of Nonparametric Regression, Theorem 6.2, and Biau and Devroye, Lectures on the Nearest Neighbor Method.

Balancing (K/n)2/d+σ2/K(K/n)^{2/d} + \sigma^2/K over KK gives

Kn2/(d+2),K^* \propto n^{2/(d+2)},

derived from ddK[(K/n)2/d+σ2/K]=0\frac{d}{dK}\left[(K/n)^{2/d} + \sigma^2/K\right] = 0, i.e. K2/d+1n2/dK^{2/d+1} \propto n^{2/d}, so Kn2/(d+2)K^* \propto n^{2/(d+2)}. This depends heavily on the dimension dd (the curse of dimensionality). At d=1d=1 this gives Kn2/3K^* \propto n^{2/3}; at d=2d=2, Kn1/2K^* \propto n^{1/2}; at large dd, KK^* approaches nn.

Intuition

With K=1K = 1: the prediction is the label of the single nearest neighbor. Zero bias (asymptotically) but maximum variance (σ2\sigma^2): the prediction depends entirely on one noisy observation.

With K=nK = n: the prediction is the average of all labels. Maximum bias (ignores all local structure) but minimum variance (σ2/n\sigma^2/n): the prediction is the same regardless of the query point.

Increasing KK averages over more neighbors, reducing variance but smoothing out local structure (increasing bias).

Why It Matters

KNN provides the cleanest example of the bias-variance tradeoff because the "complexity parameter" (KK) maps directly to variance (σ2/K\sigma^2/K). It also illustrates the curse of dimensionality: in high dimensions, the bias term (K/n)2/d(K/n)^{2/d} grows slowly with KK, meaning you need very large KK (and thus very large nn) for the bias to dominate.

Why the Classical Picture Breaks

The U-shaped curve assumes that variance increases monotonically with model complexity. In the overparameterized regime (d>nd > n), this assumption fails:

  1. At the interpolation threshold (d=nd = n): variance peaks (the system is exactly determined and ill-conditioned).
  2. Past the threshold (d>nd > n): under isotropic Gaussian design, variance decreases as dd increases further (Hastie, Montanari, Rosset, Tibshirani 2022, "Surprises in High-Dimensional Ridgeless Least Squares Interpolation"). The minimum-norm interpolator spreads its weights across more dimensions, reducing the variance contribution. This clean picture is specific to the isotropic or near-isotropic spectrum; for anisotropic feature covariance the variance can be non-monotonic at very large dd, and Nakkiran et al. 2019 show empirically that the variance curve need not be monotone even in the overparameterized regime.

This produces the double descent curve: the classical U-shape followed by a second descent. The bias-variance decomposition is still mathematically valid in the overparameterized regime; what changes is the behavior of the variance term (it is no longer monotonically increasing).

The classical tradeoff is a correct description of the underparameterized world. It is incomplete for the overparameterized world where modern deep learning operates. Understanding both regimes is essential.

Common Confusions

Watch Out

The optimum is not where the bias and variance curves cross

The minimum of test MSE sits at the minimum of Bias2+Variance\mathrm{Bias}^2 + \mathrm{Variance}, not where the two curves intersect. Formally, the first-order optimality condition at complexity parameter KK is

ddK[Bias2(K)+Variance(K)]=0,i.e.dBias2dK=dVariancedK.\frac{d}{dK}\bigl[\mathrm{Bias}^2(K) + \mathrm{Variance}(K)\bigr] = 0, \quad \text{i.e.} \quad \frac{d \, \mathrm{Bias}^2}{dK} = -\frac{d \, \mathrm{Variance}}{dK}.

That is, the two slopes cancel. Intersection of the two curves (Bias2=Variance\mathrm{Bias}^2 = \mathrm{Variance}) is a different equation and generally sits at a different point. Intersection coincides with the minimum only when both curves are locally symmetric around that point, which is a special case, not the rule. If variance rises faster than bias falls near the intersection, the optimum sits to the left of the crossing; if bias falls faster than variance rises, the optimum sits to the right. Always read the total-error curve, never the crossing point.

Watch Out

Bias-variance tradeoff is about the model class, not a single model

The bias and variance are defined over the randomness of the training set. For a single fixed model trained on a single fixed dataset, there is no tradeoff. The model either performs well or it does not. The tradeoff is a property of the procedure: how would this modeling approach perform across many possible training sets drawn from the same distribution?

Watch Out

Low training error does not mean low bias

Training error measures how well the model fits the data it was trained on. Bias measures how well the average model (over all possible training sets) approximates the true function. A model can have low training error because it memorizes the training data (low bias but also fitting noise), or because it genuinely captures the underlying pattern. These are different situations with different implications.

Watch Out

More data reduces variance, not bias

Collecting more training data helps by reducing variance (the model's predictions become more stable). It does not reduce bias: if the model class cannot represent the true function, no amount of data will fix that. A linear model fitting a quadratic function has irreducible bias that more data cannot eliminate.

Watch Out

The decomposition is specific to squared loss

The clean additive decomposition MSE = Bias2^2 + Variance + Noise above holds for squared loss. For classification with 0-1 loss there is a different additive decomposition due to Domingos 2000, using different definitions of bias and variance; see also Kohavi and Wolpert 1996 and Breiman 1998. Under these definitions variance can sometimes help (reducing error when instability flips systematically-wrong predictions toward the correct label). The framework is different from the squared-loss one, not just "more complex."

Summary

  • MSE = Bias2^2 + Variance + Irreducible Noise (exact decomposition)
  • Bias: systematic error from model limitations (too simple)
  • Variance: instability from sensitivity to training data (too complex)
  • Noise: inherent data randomness, cannot be reduced
  • Classical tradeoff: increasing complexity decreases bias, increases variance, producing a U-shaped test error curve
  • Regularization controls the tradeoff: λ\lambda trades bias for reduced variance
  • KNN: variance = σ2/K\sigma^2/K, bias K2/d\propto K^{2/d}, optimal KK balances both
  • This picture breaks in overparameterized regimes: variance is non-monotonic, leading to double descent

Exercises

ExerciseCore

Problem

A dataset has irreducible noise σ2=0.5\sigma^2 = 0.5. You fit a linear regression with 10 features to n=100n = 100 samples. Assuming the true model is linear, compute the expected test MSE averaged over a fresh test point xx drawn from the same distribution as the training inputs. How does it change if you use 50 features? Use the averaged variance form σ2d/n\sigma^2 d/n.

ExerciseCore

Problem

Derive the bias-variance decomposition for KK-nearest neighbors regression. Show that the variance is exactly σ2/K\sigma^2 / K when the noise is independent with variance σ2\sigma^2.

ExerciseAdvanced

Problem

Ridge regression with penalty λ\lambda produces estimates w^λ=(XX+λI)1Xy\hat{w}_\lambda = (X^\top X + \lambda I)^{-1} X^\top y. Show that as λ0\lambda \to 0, the estimate approaches OLS (low bias, high variance), and as λ\lambda \to \infty, the estimate approaches zero (high bias, zero variance). Explain qualitatively why there exists an optimal λ\lambda^*.

ExerciseResearch

Problem

The classical bias-variance tradeoff predicts a U-shaped test error curve. Double descent shows a second descent past the interpolation threshold. Explain, using the bias-variance decomposition, how the variance term can be non-monotonic. Specifically, why does variance decrease in the overparameterized regime despite the model having more parameters?

References

Canonical:

  • Geman, Bienenstock, Doursat, "Neural Networks and the Bias/Variance Dilemma" (Neural Computation, 1992). The paper that made the decomposition central to ML; derives it for the squared-error regression setting and discusses why variance dominates for flexible estimators.
  • Hastie, Tibshirani, Friedman, The Elements of Statistical Learning (2nd ed., 2009), §7.3 (bias-variance decomposition), §7.10 (cross-validation for the tradeoff), §7.12 (conditional vs. expected test error).
  • Wasserman, All of Statistics (2004), §6.3 (MSE decomposition), §20.3 (bias-variance for nonparametric regression).

Classical and textbook extensions:

  • Bishop, Pattern Recognition and Machine Learning (2006), §3.2 (explicit bias-variance decomposition with a Bayesian lens).
  • Domingos, "A Unified Bias-Variance Decomposition and its Applications" (ICML, 2000). Extends the decomposition beyond squared error to 0/1 loss and general loss functions.

Modern / overparameterized regime:

  • Belkin, Hsu, Ma, Mandal, "Reconciling Modern Machine Learning Practice and the Classical Bias-Variance Trade-off" (PNAS, 2019). Where the U-curve breaks.
  • Neal, Mittal, Baratin, Tantia, Scicluna, Lacoste-Julien, Mitliagkas, "A Modern Take on the Bias-Variance Tradeoff in Neural Networks" (ICLR, 2019). Empirical variance decomposition across widths.
  • Nakkiran, Kaplun, Bansal, Yang, Barak, Sutskever, "Deep Double Descent: Where Bigger Models and More Data Hurt" (ICLR, 2020). Connects bias-variance to the double-descent curve.
  • Adlam & Pennington, "Understanding Double Descent Requires a Fine-Grained Bias-Variance Decomposition" (NeurIPS, 2020). Splits variance into initialization, sampling, and label-noise components; the key finding is a multi-peak structure where the three variance components peak at different points along the complexity axis, so the total variance curve is not captured by any single coarse decomposition.

Next Topics

The natural next steps from the bias-variance tradeoff:

  • Ridge regression: explicit bias-variance control via L2L_2 regularization
  • Double descent: where the classical U-curve fails and a second descent appears
  • Implicit bias and modern generalization: why the full picture requires understanding the algorithm, not just the hypothesis class

Last reviewed: April 2026

Prerequisites

Foundations this topic depends on.

Builds on This

Next Topics