Bagging

Sneiderman, Robby

ML Methods

Bagging

Bootstrap Aggregation: train B models on bootstrap samples and average their predictions. Why averaging reduces variance, how correlation limits the gain, and out-of-bag error estimation.

CoreTier 1StableSupporting~35 min

Prerequisites

Bootstrap Methods Decision Trees and Ensembles

Start 8-question practice · 3 available Prereq Map

Learning position

Read this page in the graph.

ml-methods | layer 2 | tier 1. This page has 2 direct prerequisites and 2 published dependents.

Open Atlas Prerequisites Leads to

What next

Random Forests

This is the first curated or graph-derived continuation from the current page.

Evidence badge

Claim status

This page has no public Lean mapping yet. Use the evidence page to inspect how claim status labels work.

Show the backing system

AtlasOpen the full prerequisite graph and run grounding traces.EvidenceInspect source support, claim labels, and public trust status.LeanReview the checked declaration list, scopes, and axiom profile.

Why This Matters

Bagging: bootstrap samples with replacement, train independent models, average predictions

Bagging is the simplest ensemble method and the foundation for random forests. The idea is so clean it can be stated in one sentence: train many copies of an unstable learner on different bootstrap samples and average the results. Yet this simple trick can dramatically improve prediction accuracy for high-variance models like decision trees.

Understanding bagging teaches you the fundamental principle behind all averaging ensembles: variance reduction through diversity.

Mental Model

Imagine you ask 100 people to independently estimate the number of jellybeans in a jar. Each individual estimate is noisy, but their average is remarkably close to the truth. This is the wisdom-of-crowds effect, and it works because individual errors are partially independent and cancel when averaged.

Bagging does the same thing with models. Each bootstrap sample creates a slightly different training set, producing a slightly different model. The models make different errors, and averaging smooths these errors out.

The Bagging Algorithm

Definition

Bagging (Bootstrap Aggregation)

Given training data $\{(x_i, y_i)\}_{i=1}^n$ and a base learner $\hat{f}$ :

For $b = 1, \ldots, B$ $b = 1, \dots, B$ :
- Draw a bootstrap sample $S_b$ of size $n$ with replacement from the training data
- Train a base learner $\hat{f}_b$ on $S_b$
The bagged prediction is:

Regression: $\hat{f}_{\text{bag}}(x) = \frac{1}{B} \sum_{b=1}^{B} \hat{f}_b(x)$

Classification: $\hat{f}_{\text{bag}}(x) = \text{majority vote of } \hat{f}_1(x), \ldots, \hat{f}_B(x)$

The base learner is typically a decision tree grown without pruning (high variance, low bias). But bagging can be applied to any learner.

Why Bagging Reduces Variance

Theorem

Variance of Averaged Estimators

Statement

Let $\hat{f}_1, \ldots, \hat{f}_B$ be $B$ estimators, each with variance $\sigma^2$ and pairwise correlation $\rho$ . The variance of their average is:

$\text{Var}\left(\frac{1}{B}\sum_{b=1}^{B}\hat{f}_b\right) = \rho\sigma^2 + \frac{1-\rho}{B}\sigma^2$

Special cases:

If $\rho = 0$ (independent estimators): variance $= \sigma^2/B$ , perfect $1/B$ reduction
If $\rho = 1$ (identical estimators): variance $= \sigma^2$ , no reduction
As $B \to \infty$ : variance $\to \rho\sigma^2$ , the irreducible limit

Intuition

Averaging independent noisy estimates reduces noise by factor $1/B$ . But bagged estimators are not independent: they are trained on overlapping bootstrap samples. For two independent bootstrap samples $S_1, S_2$ of size $n$ and a fixed training point $x_i$ , in the large- $n$ limit: the probability $x_i$ appears in $S_1$ is $1 - (1-1/n)^n \to 1 - 1/e \approx 0.632$ ; the probability it appears in both (the natural pairwise-overlap quantity) is $(1 - 1/e)^2 \approx 0.400$ ; the probability it appears in the union is $1 - 1/e^2 \approx 0.865$ ; and the probability it appears in exactly one is $2(1/e)(1 - 1/e) \approx 0.465$ . So roughly 40% of training points are shared between any two bootstrap samples, and this overlap induces correlation $\rho > 0$ in the fitted models. This creates a floor below which variance cannot fall, no matter how many models you average.

Proof Sketch

$\text{Var}\left(\frac{1}{B}\sum_b \hat{f}_b\right) = \frac{1}{B^2}\left[\sum_b \text{Var}(\hat{f}_b) + \sum_{b \neq b'}\text{Cov}(\hat{f}_b, \hat{f}_{b'})\right]$

There are $B$ variance terms each equal to $\sigma^2$ and $B(B-1)$ covariance terms each equal to $\rho\sigma^2$ :

$= \frac{1}{B^2}[B\sigma^2 + B(B-1)\rho\sigma^2] = \frac{\sigma^2}{B} + \frac{B-1}{B}\rho\sigma^2 = \rho\sigma^2 + \frac{(1-\rho)\sigma^2}{B}$

Why It Matters

This formula explains when bagging works and when it does not. Bagging is most effective for high-variance learners (large $\sigma^2$ ) with moderate correlation (moderate $\rho$ ). For stable learners like k-nearest neighbors with large $k$ , $\sigma^2$ is already small and bagging helps little. For decision trees, $\sigma^2$ is large and $\rho$ is moderate, so bagging gives substantial gains. This formula also motivates random forests: feature subsampling further reduces $\rho$ .

Failure Mode

The formula assumes equal variances and uniform pairwise correlation. In practice, some bootstrap samples produce better models than others. Bagging also does not reliably reduce bias: for a linear base learner the bagged expectation equals the expectation of a single estimator, and for a nonlinear learner it can shift bias slightly (sometimes reducing it in high-variance regions, sometimes not). If the base learner is strongly biased (e.g., a shallow tree), bagging generally cannot fix this.

report a correction →

Bagging Does Not Reliably Reduce Bias

This is the most important theoretical limitation of bagging. The relationship between the bias of the bagged estimator and the bias of the base learner depends on whether the base learner is linear in the training data.

Linear base learners. If $\hat{f}(x; S)$ is a linear functional of the empirical distribution $\hat{F}_n$ of $S$ , then $\mathbb{E}_{S^*}[\hat{f}(x; S^*) \mid S] = \hat{f}(x; S)$ for bootstrap samples $S^*$ drawn from $\hat{F}_n$ , so in expectation:

$\text{Bias}(\hat{f}_{\text{bag}}) = \mathbb{E}[\hat{f}_{\text{bag}}(x)] - f(x) = \mathbb{E}[\hat{f}(x)] - f(x) = \text{Bias}(\hat{f})$

Bagging exactly preserves bias for such functionals. Sample means and fixed-design (residual-bootstrap) regressions evaluated at a fixed query $x$ are the standard examples. Ordinary least squares under the pairs bootstrap is not such a linear functional: pairs bootstrap resamples rows of the design, so $X^*$ varies across bootstrap draws and $(X^{*\top}X^*)^{-1}X^{*\top}y^*$ is a nonlinear function of the empirical distribution. The OLS-as-linear-in- $y$ identity holds only with $X$ held fixed (e.g., residual bootstrap with a fixed design), in which case the bagged OLS prediction at a query equals the OLS prediction up to bootstrap Monte Carlo noise.

Nonlinear base learners. For nonlinear learners like decision trees or neural nets, the bagged expectation $\mathbb{E}_{S^*}[\hat{f}(x; S^*)]$ (over bootstrap resamples $S^*$ from $\hat{F}_n$ ) is not in general equal to $\mathbb{E}_{S}[\hat{f}(x; S)]$ (over i.i.d. draws $S$ from the true distribution $F$ ). Bagging can therefore shift the bias, typically slightly, sometimes reducing it in high-variance regions of the input space. Breiman (1996, §5) notes this explicitly: the approximation $\text{Bias}(\hat{f}_{\text{bag}}) \approx \text{Bias}(\hat{f})$ is a useful rule of thumb, not an identity.

The practical upshot is unchanged: bagging uses deep trees (low bias, high variance) rather than shallow trees (high bias, low variance). The variance reduction from averaging compensates for the high individual tree variance, while the low bias is approximately preserved.

Out-of-Bag Error Estimation

Definition

Out-of-Bag (OOB) Error

Each bootstrap sample omits about 36.8% of the training points. For training point $(x_i, y_i)$ , let $\mathcal{B}_i = \{b : (x_i, y_i) \notin S_b\}$ be the set of trees that did not train on $x_i$ . The OOB prediction is:

$\hat{y}_i^{\text{OOB}} = \frac{1}{|\mathcal{B}_i|} \sum_{b \in \mathcal{B}_i} \hat{f}_b(x_i)$

The OOB error is the average loss over all training points using only their OOB predictions.

OOB error behaves similarly to leave-one-out cross-validation asymptotically: each held-out point is predicted by trees trained without it, on training sets of expected size $\approx 0.632n$ . It is not identical to LOOCV, however. The held-out prediction is an ensemble over a random subset of the $B$ trees (the roughly $B/e$ that never saw $x_i$ ), not a single model trained on the remaining $n-1$ points. As $B$ grows, the OOB error becomes an increasingly accurate estimate of the true test error.

The practical appeal is that OOB error comes nearly for free during training: bagging provides a built-in validation signal without requiring a held-out set or a separate cross-validation pass.

When Bagging Helps and When It Hurts

Bagging helps when the base learner is:

Unstable: small changes in the training data cause large changes in the model (e.g., deep decision trees, neural networks)
Low bias: the base learner can approximate the target function well, but with high variance

Bagging helps little when the base learner is:

Stable: predictions barely change with different training samples (e.g., linear regression, $k$ -NN with large $k$ )
High bias: the base learner systematically misses the target, and no amount of averaging can fix this (e.g., decision stumps)

Canonical Examples

Example

Bagged decision trees on a noisy regression

Consider predicting $y = \sin(x) + \epsilon$ where $\epsilon \sim N(0, 0.3)$ , with $n = 100$ training points.

A single unpruned tree fits the noise closely, producing a jagged step function with high variance across different training sets. With $B = 100$ bagged trees, the averaged prediction is a smooth approximation of $\sin(x)$ . The individual trees still overfit, but their overfitting patterns differ and cancel in the average.

Example

Why bagging a linear model does nothing

Linear regression is a stable learner. Different bootstrap samples produce nearly identical regression coefficients (perturbed by sampling noise). Averaging nearly identical predictions gives nearly the same prediction as a single model. The variance reduction is negligible because $\rho \approx 1$ .

Common Confusions

Watch Out

Bagging is NOT the same as training on more data

Bagging reuses the same $n$ data points in different combinations. It does not increase the effective sample size. Each bootstrap sample has the same number of unique observations (about $0.632n$ on average). The benefit comes from averaging diverse models, not from seeing more data.

Watch Out

Bagging does not solve underfitting

If your base learner is too simple (high bias), bagging will average many biased predictions and produce a biased ensemble. Bagging a depth-1 decision stump on a complex nonlinear problem will fail. You need a sufficiently expressive base learner for bagging to work.

Summary

Bagging = train $B$ models on bootstrap samples, average predictions
Variance of the average: $\rho\sigma^2 + (1-\rho)\sigma^2/B$
Perfect variance reduction ( $1/B$ ) only if models are independent ( $\rho = 0$ )
Bagging exactly preserves bias for linear base learners and approximately preserves it for nonlinear ones; either way, use low-bias base learners (deep trees)
OOB error gives a free estimate of test error
Bagging helps unstable, low-bias learners; it does little for stable learners
Random forests improve on bagging by adding feature subsampling to reduce $\rho$

Exercises

ExerciseCore

Problem

You train $B = 200$ bagged decision trees. Each tree has prediction variance $\sigma^2 = 4.0$ and the average pairwise correlation is $\rho = 0.4$ . Compute the ensemble variance. How does this compare to a single tree?

ExerciseAdvanced

Problem

Prove that bagging preserves bias exactly when the base learner $\hat{f}(x; S)$ is a linear functional of the training data $S$ . That is, show that $\mathbb{E}[\hat{f}_{\text{bag}}(x)] = \mathbb{E}[\hat{f}(x)]$ under this assumption, where the expectation is over the training data and bootstrap randomness. Then give a concrete counterexample showing this identity can fail for a nonlinear base learner.

References

Canonical:

Breiman, "Bagging Predictors" (Machine Learning, 1996). The original paper
Hastie, Tibshirani, Friedman, Elements of Statistical Learning, Chapter 8.7
Shalev-Shwartz & Ben-David, Understanding Machine Learning (2014), Chapter 18

Current:

Biau, Devroye, Lugosi, "Consistency of Random Forests and Other Averaging Classifiers" (JMLR, 2008)
Grandvalet, "Bagging Equalizes Influence" (Machine Learning, 2004)
Buhlmann & Yu, "Analyzing Bagging" (Annals of Statistics, 2002)
Breiman, "Random Forests" (Machine Learning, 2001)
Wager, Hastie, Efron, "Confidence Intervals for Random Forests: The Jackknife and the Infinitesimal Jackknife" (JMLR, 2014)

Next Topics

The natural next steps from bagging:

Random forests: bagging plus feature subsampling for further decorrelation
Gradient boosting: the complementary ensemble strategy that reduces bias

Last reviewed: April 26, 2026

Canonical graph

Required before and derived from this topic

These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.

Full prerequisite chain All derived topics

Required prerequisites

2

Bootstrap Methodslayer 2 · tier 1
Decision Trees and Ensembleslayer 2 · tier 2

Derived topics

2

Random Forestslayer 2 · tier 1
Ensemble Methods Theorylayer 2 · tier 2

Graph-backed continuations

Random Forests Ensemble Methods Theory