ML Methods
Bagging
Bootstrap Aggregation: train B models on bootstrap samples and average their predictions. Why averaging reduces variance, how correlation limits the gain, and out-of-bag error estimation.
Prerequisites
Why This Matters
Bagging is the simplest ensemble method and the foundation for random forests. The idea is so clean it can be stated in one sentence: train many copies of an unstable learner on different bootstrap samples and average the results. Yet this simple trick can dramatically improve prediction accuracy for high-variance models like decision trees.
Understanding bagging teaches you the fundamental principle behind all averaging ensembles: variance reduction through diversity.
Mental Model
Imagine you ask 100 people to independently estimate the number of jellybeans in a jar. Each individual estimate is noisy, but their average is remarkably close to the truth. This is the wisdom-of-crowds effect, and it works because individual errors are partially independent and cancel when averaged.
Bagging does the same thing with models. Each bootstrap sample creates a slightly different training set, producing a slightly different model. The models make different errors, and averaging smooths these errors out.
The Bagging Algorithm
Bagging (Bootstrap Aggregation)
Given training data and a base learner :
- For :
- Draw a bootstrap sample of size with replacement from the training data
- Train a base learner on
- The bagged prediction is:
Regression:
Classification:
The base learner is typically a decision tree grown without pruning (high variance, low bias). But bagging can be applied to any learner.
Why Bagging Reduces Variance
Variance of Averaged Estimators
Statement
Let be estimators, each with variance and pairwise correlation . The variance of their average is:
Special cases:
- If (independent estimators): variance , perfect reduction
- If (identical estimators): variance , no reduction
- As : variance , the irreducible limit
Intuition
Averaging independent noisy estimates reduces noise by factor . But bagged estimators are not independent: they are trained on overlapping bootstrap samples (each pair shares about fraction of training points in expectation). This correlation creates a floor below which variance cannot fall, no matter how many models you average.
Proof Sketch
There are variance terms each equal to and covariance terms each equal to :
Why It Matters
This formula explains when bagging works and when it does not. Bagging is most effective for high-variance learners (large ) with moderate correlation (moderate ). For stable learners like k-nearest neighbors with large , is already small and bagging helps little. For decision trees, is large and is moderate, so bagging gives substantial gains. This formula also motivates random forests: feature subsampling further reduces .
Failure Mode
The formula assumes equal variances and uniform pairwise correlation. In practice, some bootstrap samples produce better models than others. Bagging also does not reduce bias at all: the expectation of the bagged estimator equals the expectation of a single estimator (since the bootstrap distribution is symmetric). If the base learner is biased (e.g., a shallow tree), bagging cannot fix this.
Bagging Does Not Reduce Bias
This is the most important theoretical limitation of bagging. The bias of the bagged estimator is:
Averaging preserves the mean. If each base learner systematically under- or over-predicts, the average will too. This is why bagging uses deep trees (low bias, high variance) rather than shallow trees (high bias, low variance). The variance reduction from averaging compensates for the high individual tree variance, while the low bias is preserved.
Out-of-Bag Error Estimation
Out-of-Bag (OOB) Error
Each bootstrap sample omits about 36.8% of the training points. For training point , let be the set of trees that did not train on . The OOB prediction is:
The OOB error is the average loss over all training points using only their OOB predictions.
The OOB error is leave-one-out cross-validation performed for free during bagging. Each training point is predicted by the roughly trees that never saw it. As grows, the OOB error becomes an increasingly accurate estimate of the true test error.
This means bagging provides a built-in validation mechanism without requiring a held-out set or cross-validation.
When Bagging Helps and When It Hurts
Bagging helps when the base learner is:
- Unstable: small changes in the training data cause large changes in the model (e.g., deep decision trees, neural networks)
- Low bias: the base learner can approximate the target function well, but with high variance
Bagging helps little when the base learner is:
- Stable: predictions barely change with different training samples (e.g., linear regression, -NN with large )
- High bias: the base learner systematically misses the target, and no amount of averaging can fix this (e.g., decision stumps)
Canonical Examples
Bagged decision trees on a noisy regression
Consider predicting where , with training points.
A single unpruned tree fits the noise closely, producing a jagged step function with high variance across different training sets. With bagged trees, the averaged prediction is a smooth approximation of . The individual trees still overfit, but their overfitting patterns differ and cancel in the average.
Why bagging a linear model does nothing
Linear regression is a stable learner. Different bootstrap samples produce nearly identical regression coefficients (perturbed by sampling noise). Averaging nearly identical predictions gives nearly the same prediction as a single model. The variance reduction is negligible because .
Common Confusions
Bagging is NOT the same as training on more data
Bagging reuses the same data points in different combinations. It does not increase the effective sample size. Each bootstrap sample has the same number of unique observations (about on average). The benefit comes from averaging diverse models, not from seeing more data.
Bagging does not solve underfitting
If your base learner is too simple (high bias), bagging will average many biased predictions and produce a biased ensemble. Bagging a depth-1 decision stump on a complex nonlinear problem will fail. You need a sufficiently expressive base learner for bagging to work.
Summary
- Bagging = train models on bootstrap samples, average predictions
- Variance of the average:
- Perfect variance reduction () only if models are independent ()
- Bagging does not reduce bias; use low-bias base learners (deep trees)
- OOB error gives a free estimate of test error
- Bagging helps unstable, low-bias learners; it does little for stable learners
- Random forests improve on bagging by adding feature subsampling to reduce
Exercises
Problem
You train bagged decision trees. Each tree has prediction variance and the average pairwise correlation is . Compute the ensemble variance. How does this compare to a single tree?
Problem
Prove that bagging does not reduce bias. Specifically, show that where the expectation is over the training data and bootstrap randomness.
References
Canonical:
- Breiman, "Bagging Predictors" (Machine Learning, 1996). The original paper
- Hastie, Tibshirani, Friedman, Elements of Statistical Learning, Chapter 8.7
Current:
-
Biau, Devroye, Lugosi, "Consistency of Random Forests and Other Averaging Classifiers" (JMLR, 2008)
-
Grandvalet, "Bagging Equalizes Influence" (Machine Learning, 2004)
-
Bishop, Pattern Recognition and Machine Learning (2006), Chapters 1-14
-
Murphy, Machine Learning: A Probabilistic Perspective (2012), Chapters 1-28
Next Topics
The natural next steps from bagging:
- Random forests: bagging plus feature subsampling for further decorrelation
- Gradient boosting: the complementary ensemble strategy that reduces bias
Last reviewed: April 2026
Prerequisites
Foundations this topic depends on.
- Bootstrap MethodsLayer 2
- Common Probability DistributionsLayer 0A
- Sets, Functions, and RelationsLayer 0A
- Basic Logic and Proof TechniquesLayer 0A
- Decision Trees and EnsemblesLayer 2
- Empirical Risk MinimizationLayer 2
- Concentration InequalitiesLayer 1
- Expectation, Variance, Covariance, and MomentsLayer 0A
- Bias-Variance TradeoffLayer 2
Builds on This
- Ensemble Methods TheoryLayer 2