ML Methods
Overfitting and Underfitting
The two failure modes of supervised learning: models that memorize noise versus models too simple to capture signal. Diagnosis via training-validation gaps.
Prerequisites
Why This Matters
Before tuning hyperparameters, adding regularization, or collecting more data, you need to know which failure mode you are in. Overfitting and underfitting are the first diagnostic every practitioner should check. The treatment for each is opposite: reducing model complexity helps overfitting but worsens underfitting. Getting this wrong wastes time and compute.
Mental Model
A model that memorizes every training example, including noise, will score perfectly on training data but fail on new inputs. That is overfitting. A model so rigid it cannot capture the true pattern will perform poorly on both training and test data. That is underfitting. The goal is the region between these two extremes where the model captures signal without memorizing noise. Cross-validation is the standard diagnostic tool, and techniques like dropout help prevent overfitting in deep networks.
Formal Setup
Let denote the population risk and the empirical risk of hypothesis . For a chosen model from hypothesis class :
Overfitting
A model overfits when its population risk is substantially larger than its empirical risk. The model has learned patterns specific to the training sample that do not generalize. Equivalently, the generalization gap is large.
Underfitting
A model underfits when both its population risk and empirical risk are high relative to the Bayes risk . The hypothesis class is too restrictive to approximate the true relationship. The approximation error dominates.
Diagnostic Framework
The key diagnostic is comparing training loss to validation loss.
| Condition | Training Loss | Validation Loss | Diagnosis |
|---|---|---|---|
| Both high | High | High | Underfitting |
| Gap large | Low | High | Overfitting |
| Both low | Low | Low | Good fit |
This is directly connected to the bias-variance tradeoff. Underfitting corresponds to high bias. Overfitting corresponds to high variance.
Main Theorems
Generalization Gap Decomposition
Statement
For any hypothesis selected by ERM over , the excess risk decomposes as:
Underfitting occurs when the approximation error dominates. Overfitting occurs when the estimation error dominates due to a large generalization gap.
Intuition
If the best possible model in your class is already bad, no amount of data will help. That is underfitting. If the best model is good but ERM picks a different one that happens to fit training noise, that is overfitting.
Proof Sketch
This is an algebraic identity: add and subtract . The content is in recognizing which term causes failure in a given situation.
Why It Matters
This decomposition tells you what to fix. High approximation error means you need a richer model class. High estimation error means you need more data, regularization, or a simpler model class.
Failure Mode
In modern deep learning with interpolating models, the estimation error can be large (the model memorizes training data) yet generalization is still good. This "benign overfitting" phenomenon violates the classical intuition. See the benign overfitting and double descent topics.
Learning Curves
A learning curve plots training and validation loss as a function of training set size .
- Underfitting signature: both curves plateau at a high loss. Adding more data does not help because the model class cannot represent the true function.
- Overfitting signature: training loss is low, validation loss is high, and the gap shrinks slowly as increases. More data helps but may require very large .
- Good fit signature: both curves converge to a low value with a small gap.
Validation Curves
A validation curve plots training and validation loss as a function of model complexity (e.g., polynomial degree, number of hidden units, regularization strength).
- As complexity increases from zero: training loss decreases, validation loss first decreases (reducing underfitting) then increases (increasing overfitting).
- The minimum of the validation curve identifies the optimal complexity.
Common Confusions
Zero training loss does not always mean overfitting
In classical statistics, interpolating the training data (zero training loss) implies overfitting. In modern deep learning, models routinely interpolate training data and still generalize well. The relationship between training loss and generalization is more nuanced than the simple U-shaped validation curve suggests. See the double descent topic for details.
More data does not fix underfitting
If your model class cannot represent the true function, collecting more data will not help. The approximation error is a property of , not of . You need a richer model class or better features.
Regularization is not always the answer
Regularization reduces overfitting by constraining the hypothesis class. But if you are underfitting, adding regularization makes things worse. Diagnose first, then prescribe. See model evaluation best practices for systematic evaluation procedures.
Canonical Examples
Polynomial regression diagnostic
Fit polynomials of degree 1, 5, and 20 to points from . Degree 1: both train and test error are high (underfitting, the line cannot capture the sine curve). Degree 5: train error is moderate, test error is similar (good fit). Degree 20: train error is near zero, test error is very large (overfitting, the polynomial oscillates wildly between data points).
Exercises
Problem
You train a linear model and observe training MSE of 15.2 and validation MSE of 16.1. You then train a 10-layer neural network and observe training MSE of 0.3 and validation MSE of 12.8. Diagnose each model.
Problem
A colleague argues: "Our model has 100x more parameters than training examples, so it must be overfitting." Under what conditions is this reasoning correct, and when does it fail?
References
Canonical:
- Shalev-Shwartz & Ben-David, Understanding Machine Learning, Chapters 2, 5
- Hastie, Tibshirani, Friedman, Elements of Statistical Learning, Chapter 7
Current:
-
Belkin et al., "Reconciling modern machine learning practice and the bias-variance tradeoff" (2019)
-
Nakkiran et al., "Deep Double Descent" (2019)
-
Bishop, Pattern Recognition and Machine Learning (2006), Chapters 1-14
-
Murphy, Machine Learning: A Probabilistic Perspective (2012), Chapters 1-28
Next Topics
- Cross-validation theory: systematic method for estimating the generalization gap
- Regularization theory: formal tools for controlling overfitting
Last reviewed: April 2026
Prerequisites
Foundations this topic depends on.