Overfitting and Underfitting

Sneiderman, Robby

ML Methods

Overfitting and Underfitting

The two failure modes of supervised learning: models that memorize noise versus models too simple to capture signal. Diagnosis via training-validation gaps.

CoreTier 1StableSupporting~35 min

Prerequisites

Empirical Risk Minimization Bias Variance Tradeoff

Start 8-question practice · 11 available 3-question pulse check Prereq Map

Learning position

Read this page in the graph.

ml-methods | layer 2 | tier 1. This page has 2 direct prerequisites and 2 published dependents.

Open Atlas Prerequisites Leads to

What next

Cross-Validation Theory

This is the first curated or graph-derived continuation from the current page.

Evidence badge

Claim status

This page has no public Lean mapping yet. Use the evidence page to inspect how claim status labels work.

Show the backing system

AtlasOpen the full prerequisite graph and run grounding traces.EvidenceInspect source support, claim labels, and public trust status.LeanReview the checked declaration list, scopes, and axiom profile.

Why This Matters

Before tuning hyperparameters, adding regularization, or collecting more data, you need to know which failure mode you are in. Overfitting and underfitting are the first diagnostic every practitioner should check. The treatment for each is opposite: reducing model complexity helps overfitting but worsens underfitting. Getting this wrong wastes time and compute.

theorem visual

The diagnostic is the gap, not the training loss alone

$Underfitting means the model class is too weak. Overfitting means the selected model learned sample-specific noise. The validation gap tells you which fix is even allowed.$

underfitting

$R (\hat{h}) \approx R_{n} (\hat{h}) and both are high$

$The model is too simple. More data alone will not fix missing representation power.$

good fit

$R (\hat{h}) - R_{n} (\hat{h}) \approx 0$

$The learned rule captures stable signal and leaves little gap to validation.$

overfitting

$R (\hat{h}) ≫ R_{n} (\hat{h})$

$Training loss looks impressive because the model has absorbed sample noise.$

Want a larger standalone lab instead of the embedded diagram? Try the Overfitting Arena demo to tune flexibility, sample count, and label noise side by side.

Mental Model

A model that memorizes every training example, including noise, will score perfectly on training data but fail on new inputs. That is overfitting. A model so rigid it cannot capture the true pattern will perform poorly on both training and test data. That is underfitting. The goal is the region between these two extremes where the model captures signal without memorizing noise. Cross-validation is the standard diagnostic tool, and techniques like dropout help prevent overfitting in deep networks.

Formal Setup

Let $R(h)$ denote the population risk and $\hat{R}_n(h)$ the empirical risk of hypothesis $h$ . For a chosen model $\hat{h}$ from hypothesis class $\mathcal{H}$ :

Definition

Overfitting $R (\hat{h}) ≫ \hat{R}_{n} (\hat{h})$

A model overfits when its population risk is substantially larger than its empirical risk. The model has learned patterns specific to the training sample that do not generalize. Equivalently, the generalization gap $R(\hat{h}) - \hat{R}_n(\hat{h})$ is large.

Definition

Underfitting $R (\hat{h}) \approx \hat{R}_{n} (\hat{h}), b o t h l a r g e$

A model underfits when both its population risk and empirical risk are high relative to the Bayes risk $R^*$ . The hypothesis class $\mathcal{H}$ is too restrictive to approximate the true relationship. The approximation error $\min_{h \in \mathcal{H}} R(h) - R^*$ dominates.

Diagnostic Framework

The key diagnostic is comparing training loss to validation loss.

Condition	Training Loss	Validation Loss	Diagnosis
Both high	High	High	Underfitting
Gap large	Low	High	Overfitting
Both low	Low	Low	Good fit

This is directly connected to the bias-variance tradeoff. Underfitting corresponds to high bias. Overfitting corresponds to high variance.

Main Theorems

Theorem

Generalization Gap Decomposition

Statement

For any hypothesis $\hat{h}$ selected by ERM over $\mathcal{H}$ , the excess risk decomposes as:

$R(\hat{h}) - R^* = \underbrace{\left(\min_{h \in \mathcal{H}} R(h) - R^*\right)}_{\text{approximation error}} + \underbrace{\left(R(\hat{h}) - \min_{h \in \mathcal{H}} R(h)\right)}_{\text{estimation error}}$

Underfitting occurs when the approximation error dominates. Overfitting occurs when the estimation error dominates due to a large generalization gap.

Intuition

If the best possible model in your class is already bad, no amount of data will help. That is underfitting. If the best model is good but ERM picks a different one that happens to fit training noise, that is overfitting.

Proof Sketch

This is an algebraic identity: add and subtract $\min_{h \in \mathcal{H}} R(h)$ . The content is in recognizing which term causes failure in a given situation.

Why It Matters

This decomposition tells you what to fix. High approximation error means you need a richer model class. High estimation error means you need more data, regularization, or a simpler model class.

Failure Mode

In modern deep learning with interpolating models, the estimation error can be large (the model memorizes training data) yet generalization is still good. This "benign overfitting" phenomenon violates the classical intuition. See the benign overfitting and double descent topics.

report a correction →

Learning Curves

A learning curve plots training and validation loss as a function of training set size $n$ .

Underfitting signature: both curves plateau at a high loss. Adding more data does not help because the model class cannot represent the true function.
Overfitting signature: training loss is low, validation loss is high, and the gap shrinks slowly as $n$ increases. More data helps but may require very large $n$ .
Good fit signature: both curves converge to a low value with a small gap.

Validation Curves

A validation curve plots training and validation loss as a function of model complexity (e.g., polynomial degree, number of hidden units, regularization strength).

As complexity increases from zero: training loss decreases, validation loss first decreases (reducing underfitting) then increases (increasing overfitting).
The minimum of the validation curve identifies the optimal complexity.

Common Confusions

Watch Out

Zero training loss does not always mean overfitting

In classical statistics, interpolating the training data (zero training loss) implies overfitting. In modern deep learning, models routinely interpolate training data and still generalize well. The relationship between training loss and generalization is more nuanced than the simple U-shaped validation curve suggests. See the double descent topic for details.

Watch Out

More data does not fix underfitting

If your model class cannot represent the true function, collecting more data will not help. The approximation error is a property of $\mathcal{H}$ , not of $n$ . You need a richer model class or better features.

Watch Out

Regularization is not always the answer

Regularization reduces overfitting by constraining the hypothesis class. But if you are underfitting, adding regularization makes things worse. Diagnose first, then prescribe. See model evaluation best practices for systematic evaluation procedures.

Canonical Examples

Example

Polynomial regression diagnostic

Fit polynomials of degree 1, 5, and 20 to $n = 30$ points from $y = \sin(x) + \epsilon$ . Degree 1: both train and test error are high (underfitting, the line cannot capture the sine curve). Degree 5: train error is moderate, test error is similar (good fit). Degree 20: train error is near zero, test error is very large (overfitting, the polynomial oscillates wildly between data points).

Exercises

ExerciseCore

Problem

You train a linear model and observe training MSE of 15.2 and validation MSE of 16.1. You then train a 10-layer neural network and observe training MSE of 0.3 and validation MSE of 12.8. Diagnose each model.

ExerciseAdvanced

Problem

A colleague argues: "Our model has 100x more parameters than training examples, so it must be overfitting." Under what conditions is this reasoning correct, and when does it fail?

References

Canonical:

Shalev-Shwartz & Ben-David, Understanding Machine Learning, Chapters 2, 5
Hastie, Tibshirani, Friedman, Elements of Statistical Learning, Chapter 7

Current:

Belkin et al., "Reconciling modern machine learning practice and the bias-variance tradeoff" (2019). arXiv:1812.11118.
Nakkiran et al., "Deep Double Descent" (2019). arXiv:1912.02292.
Bartlett, P. L., Long, P. M., Lugosi, G., & Tsigler, A. (2020). Benign overfitting in linear regression. Proceedings of the National Academy of Sciences, 117(48), 30063-30070. arXiv:1906.11300.
Zhang, C., Bengio, S., Hardt, M., Recht, B., & Vinyals, O. (2017). Understanding deep learning requires rethinking generalization. ICLR. arXiv:1611.03530.

Next Topics

Cross-validation theory: systematic method for estimating the generalization gap
Regularization theory: formal tools for controlling overfitting

Last reviewed: April 22, 2026

Canonical graph

Required before and derived from this topic

These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.

Full prerequisite chain All derived topics

Required prerequisites

2

Empirical Risk Minimizationlayer 2 · tier 1
Bias-Variance Tradeofflayer 2 · tier 2

Derived topics

2

Cross-Validation Theorylayer 2 · tier 2
Regularization Theorylayer 2 · tier 2

Graph-backed continuations

Cross-Validation Theory Regularization Theory