Beta. Content is under active construction and has not been peer-reviewed. Report errors on GitHub.Disclaimer

ML Methods

Bagging

Bootstrap Aggregation: train B models on bootstrap samples and average their predictions. Why averaging reduces variance, how correlation limits the gain, and out-of-bag error estimation.

CoreTier 1Stable~35 min

Why This Matters

Bagging is the simplest ensemble method and the foundation for random forests. The idea is so clean it can be stated in one sentence: train many copies of an unstable learner on different bootstrap samples and average the results. Yet this simple trick can dramatically improve prediction accuracy for high-variance models like decision trees.

Understanding bagging teaches you the fundamental principle behind all averaging ensembles: variance reduction through diversity.

Mental Model

Imagine you ask 100 people to independently estimate the number of jellybeans in a jar. Each individual estimate is noisy, but their average is remarkably close to the truth. This is the wisdom-of-crowds effect, and it works because individual errors are partially independent and cancel when averaged.

Bagging does the same thing with models. Each bootstrap sample creates a slightly different training set, producing a slightly different model. The models make different errors, and averaging smooths these errors out.

The Bagging Algorithm

Definition

Bagging (Bootstrap Aggregation)

Given training data {(xi,yi)}i=1n\{(x_i, y_i)\}_{i=1}^n and a base learner f^\hat{f}:

  1. For b=1,,Bb = 1, \ldots, B:
    • Draw a bootstrap sample SbS_b of size nn with replacement from the training data
    • Train a base learner f^b\hat{f}_b on SbS_b
  2. The bagged prediction is:

Regression: f^bag(x)=1Bb=1Bf^b(x)\hat{f}_{\text{bag}}(x) = \frac{1}{B} \sum_{b=1}^{B} \hat{f}_b(x)

Classification: f^bag(x)=majority vote of f^1(x),,f^B(x)\hat{f}_{\text{bag}}(x) = \text{majority vote of } \hat{f}_1(x), \ldots, \hat{f}_B(x)

The base learner is typically a decision tree grown without pruning (high variance, low bias). But bagging can be applied to any learner.

Why Bagging Reduces Variance

Theorem

Variance of Averaged Estimators

Statement

Let f^1,,f^B\hat{f}_1, \ldots, \hat{f}_B be BB estimators, each with variance σ2\sigma^2 and pairwise correlation ρ\rho. The variance of their average is:

Var(1Bb=1Bf^b)=ρσ2+1ρBσ2\text{Var}\left(\frac{1}{B}\sum_{b=1}^{B}\hat{f}_b\right) = \rho\sigma^2 + \frac{1-\rho}{B}\sigma^2

Special cases:

  • If ρ=0\rho = 0 (independent estimators): variance =σ2/B= \sigma^2/B, perfect 1/B1/B reduction
  • If ρ=1\rho = 1 (identical estimators): variance =σ2= \sigma^2, no reduction
  • As BB \to \infty: variance ρσ2\to \rho\sigma^2, the irreducible limit

Intuition

Averaging independent noisy estimates reduces noise by factor 1/B1/B. But bagged estimators are not independent: they are trained on overlapping bootstrap samples (each pair shares about 12(11/n)n12/e20.731 - 2(1-1/n)^n \approx 1 - 2/e^2 \approx 0.73 fraction of training points in expectation). This correlation ρ>0\rho > 0 creates a floor below which variance cannot fall, no matter how many models you average.

Proof Sketch

Var(1Bbf^b)=1B2[bVar(f^b)+bbCov(f^b,f^b)]\text{Var}\left(\frac{1}{B}\sum_b \hat{f}_b\right) = \frac{1}{B^2}\left[\sum_b \text{Var}(\hat{f}_b) + \sum_{b \neq b'}\text{Cov}(\hat{f}_b, \hat{f}_{b'})\right]

There are BB variance terms each equal to σ2\sigma^2 and B(B1)B(B-1) covariance terms each equal to ρσ2\rho\sigma^2:

=1B2[Bσ2+B(B1)ρσ2]=σ2B+B1Bρσ2=ρσ2+(1ρ)σ2B= \frac{1}{B^2}[B\sigma^2 + B(B-1)\rho\sigma^2] = \frac{\sigma^2}{B} + \frac{B-1}{B}\rho\sigma^2 = \rho\sigma^2 + \frac{(1-\rho)\sigma^2}{B}

Why It Matters

This formula explains when bagging works and when it does not. Bagging is most effective for high-variance learners (large σ2\sigma^2) with moderate correlation (moderate ρ\rho). For stable learners like k-nearest neighbors with large kk, σ2\sigma^2 is already small and bagging helps little. For decision trees, σ2\sigma^2 is large and ρ\rho is moderate, so bagging gives substantial gains. This formula also motivates random forests: feature subsampling further reduces ρ\rho.

Failure Mode

The formula assumes equal variances and uniform pairwise correlation. In practice, some bootstrap samples produce better models than others. Bagging also does not reduce bias at all: the expectation of the bagged estimator equals the expectation of a single estimator (since the bootstrap distribution is symmetric). If the base learner is biased (e.g., a shallow tree), bagging cannot fix this.

Bagging Does Not Reduce Bias

This is the most important theoretical limitation of bagging. The bias of the bagged estimator is:

Bias(f^bag)=E[f^bag(x)]f(x)=E[f^(x)]f(x)=Bias(f^)\text{Bias}(\hat{f}_{\text{bag}}) = \mathbb{E}[\hat{f}_{\text{bag}}(x)] - f(x) = \mathbb{E}[\hat{f}(x)] - f(x) = \text{Bias}(\hat{f})

Averaging preserves the mean. If each base learner systematically under- or over-predicts, the average will too. This is why bagging uses deep trees (low bias, high variance) rather than shallow trees (high bias, low variance). The variance reduction from averaging compensates for the high individual tree variance, while the low bias is preserved.

Out-of-Bag Error Estimation

Definition

Out-of-Bag (OOB) Error

Each bootstrap sample omits about 36.8% of the training points. For training point (xi,yi)(x_i, y_i), let Bi={b:(xi,yi)Sb}\mathcal{B}_i = \{b : (x_i, y_i) \notin S_b\} be the set of trees that did not train on xix_i. The OOB prediction is:

y^iOOB=1BibBif^b(xi)\hat{y}_i^{\text{OOB}} = \frac{1}{|\mathcal{B}_i|} \sum_{b \in \mathcal{B}_i} \hat{f}_b(x_i)

The OOB error is the average loss over all training points using only their OOB predictions.

The OOB error is leave-one-out cross-validation performed for free during bagging. Each training point is predicted by the roughly B/eB/e trees that never saw it. As BB grows, the OOB error becomes an increasingly accurate estimate of the true test error.

This means bagging provides a built-in validation mechanism without requiring a held-out set or cross-validation.

When Bagging Helps and When It Hurts

Bagging helps when the base learner is:

  • Unstable: small changes in the training data cause large changes in the model (e.g., deep decision trees, neural networks)
  • Low bias: the base learner can approximate the target function well, but with high variance

Bagging helps little when the base learner is:

  • Stable: predictions barely change with different training samples (e.g., linear regression, kk-NN with large kk)
  • High bias: the base learner systematically misses the target, and no amount of averaging can fix this (e.g., decision stumps)

Canonical Examples

Example

Bagged decision trees on a noisy regression

Consider predicting y=sin(x)+ϵy = \sin(x) + \epsilon where ϵN(0,0.3)\epsilon \sim N(0, 0.3), with n=100n = 100 training points.

A single unpruned tree fits the noise closely, producing a jagged step function with high variance across different training sets. With B=100B = 100 bagged trees, the averaged prediction is a smooth approximation of sin(x)\sin(x). The individual trees still overfit, but their overfitting patterns differ and cancel in the average.

Example

Why bagging a linear model does nothing

Linear regression is a stable learner. Different bootstrap samples produce nearly identical regression coefficients (perturbed by sampling noise). Averaging nearly identical predictions gives nearly the same prediction as a single model. The variance reduction is negligible because ρ1\rho \approx 1.

Common Confusions

Watch Out

Bagging is NOT the same as training on more data

Bagging reuses the same nn data points in different combinations. It does not increase the effective sample size. Each bootstrap sample has the same number of unique observations (about 0.632n0.632n on average). The benefit comes from averaging diverse models, not from seeing more data.

Watch Out

Bagging does not solve underfitting

If your base learner is too simple (high bias), bagging will average many biased predictions and produce a biased ensemble. Bagging a depth-1 decision stump on a complex nonlinear problem will fail. You need a sufficiently expressive base learner for bagging to work.

Summary

  • Bagging = train BB models on bootstrap samples, average predictions
  • Variance of the average: ρσ2+(1ρ)σ2/B\rho\sigma^2 + (1-\rho)\sigma^2/B
  • Perfect variance reduction (1/B1/B) only if models are independent (ρ=0\rho = 0)
  • Bagging does not reduce bias; use low-bias base learners (deep trees)
  • OOB error gives a free estimate of test error
  • Bagging helps unstable, low-bias learners; it does little for stable learners
  • Random forests improve on bagging by adding feature subsampling to reduce ρ\rho

Exercises

ExerciseCore

Problem

You train B=200B = 200 bagged decision trees. Each tree has prediction variance σ2=4.0\sigma^2 = 4.0 and the average pairwise correlation is ρ=0.4\rho = 0.4. Compute the ensemble variance. How does this compare to a single tree?

ExerciseAdvanced

Problem

Prove that bagging does not reduce bias. Specifically, show that E[f^bag(x)]=E[f^(x)]\mathbb{E}[\hat{f}_{\text{bag}}(x)] = \mathbb{E}[\hat{f}(x)] where the expectation is over the training data and bootstrap randomness.

References

Canonical:

  • Breiman, "Bagging Predictors" (Machine Learning, 1996). The original paper
  • Hastie, Tibshirani, Friedman, Elements of Statistical Learning, Chapter 8.7

Current:

  • Biau, Devroye, Lugosi, "Consistency of Random Forests and Other Averaging Classifiers" (JMLR, 2008)

  • Grandvalet, "Bagging Equalizes Influence" (Machine Learning, 2004)

  • Bishop, Pattern Recognition and Machine Learning (2006), Chapters 1-14

  • Murphy, Machine Learning: A Probabilistic Perspective (2012), Chapters 1-28

Next Topics

The natural next steps from bagging:

  • Random forests: bagging plus feature subsampling for further decorrelation
  • Gradient boosting: the complementary ensemble strategy that reduces bias

Last reviewed: April 2026

Prerequisites

Foundations this topic depends on.

Builds on This

Next Topics