Beta. Content is under active construction and has not been peer-reviewed. Report errors on GitHub.Disclaimer

ML Methods

Ensemble Methods Theory

Why combining multiple models outperforms any single model: bias-variance decomposition for ensembles, diversity conditions, and the theoretical foundations of bagging, boosting, and stacking.

CoreTier 2Stable~55 min

Why This Matters

Random forests, gradient boosting, and model stacking dominate tabular data competitions and production ML systems. Understanding why combining models works, and when it does not, requires decomposing ensemble error into bias, variance, and covariance terms. The central insight: averaging reduces variance only when the models being averaged make different errors.

Mental Model

Consider asking 100 people to estimate the weight of an ox. If each person's error is independent, the average of their guesses will be much closer to the true weight than any individual guess. This is the "wisdom of crowds" effect. But if everyone copies the same person's answer, averaging 100 copies gives no improvement. The benefit of averaging is proportional to the diversity of the individual estimators.

Formal Setup

Let f(x)f(x) be the true function. We have MM base models h1,,hMh_1, \ldots, h_M and form an ensemble prediction:

hˉ(x)=1Mm=1Mhm(x)\bar{h}(x) = \frac{1}{M}\sum_{m=1}^{M} h_m(x)

Each base model is trained on a (possibly different) sample from distribution D\mathcal{D}.

Definition

Ensemble Ambiguity

The ambiguity (or diversity) of an ensemble at point xx is the average squared deviation of individual predictions from the ensemble prediction:

A(x)=1Mm=1M(hm(x)hˉ(x))2A(x) = \frac{1}{M}\sum_{m=1}^{M}(h_m(x) - \bar{h}(x))^2

Higher ambiguity means the models disagree more, which is necessary (but not sufficient) for ensemble improvement.

Definition

Ensemble Correlation

The average pairwise correlation between base model errors is:

ρˉ=1M(M1)mmCorr(ϵm,ϵm)\bar{\rho} = \frac{1}{M(M-1)}\sum_{m \neq m'}\text{Corr}(\epsilon_m, \epsilon_{m'})

where ϵm(x)=hm(x)f(x)\epsilon_m(x) = h_m(x) - f(x). Lower correlation means more effective variance reduction.

Main Theorems

Theorem

Variance Reduction by Averaging

Statement

If each base model has variance σ2\sigma^2 and the average pairwise correlation of errors is ρˉ\bar{\rho}, then the variance of the ensemble average is:

Var(hˉ(x))=ρˉσ2+1ρˉMσ2\text{Var}(\bar{h}(x)) = \bar{\rho}\sigma^2 + \frac{1 - \bar{\rho}}{M}\sigma^2

As MM \to \infty, the ensemble variance converges to ρˉσ2\bar{\rho}\sigma^2.

Intuition

The first term ρˉσ2\bar{\rho}\sigma^2 is the irreducible variance from correlated errors. The second term (1ρˉ)σ2M\frac{(1-\bar{\rho})\sigma^2}{M} vanishes as you add more models. When models are uncorrelated (ρˉ=0\bar{\rho} = 0), variance decreases as σ2/M\sigma^2/M. When perfectly correlated (ρˉ=1\bar{\rho} = 1), averaging does nothing.

Proof Sketch

Write Var(hˉ)=1M2m,mCov(ϵm,ϵm)\text{Var}(\bar{h}) = \frac{1}{M^2}\sum_{m,m'}\text{Cov}(\epsilon_m, \epsilon_{m'}). Separate diagonal terms (m=mm = m', contributing σ2/M\sigma^2/M) from off-diagonal terms (contributing M1Mρˉσ2\frac{M-1}{M}\bar{\rho}\sigma^2). Combine and simplify.

Why It Matters

This theorem explains why bagging works. Bootstrap sampling introduces randomness that reduces ρˉ\bar{\rho}. Random forests go further by randomizing feature subsets at each split, which reduces correlation below what bagging alone achieves.

Failure Mode

When base models are highly correlated (e.g., slight perturbations of the same model), ρˉ1\bar{\rho} \approx 1 and averaging provides negligible improvement. This happens when the base learner is stable (insensitive to training data perturbations), such as k-NN with large kk.

Proposition

Boosting Reduces Bias

Statement

Given MM weak classifiers, each with weighted training error ϵm<1/2\epsilon_m < 1/2 on the reweighted distribution at round mm, the AdaBoost training error satisfies:

R^(hˉ)exp(2m=1Mγm2)\hat{R}(\bar{h}) \leq \exp\left(-2\sum_{m=1}^{M}\gamma_m^2\right)

where γm=1/2ϵm\gamma_m = 1/2 - \epsilon_m is the edge of the mm-th weak learner. The training error decreases exponentially in the number of rounds.

Intuition

Each round of boosting focuses on the examples the current ensemble gets wrong. The next weak learner corrects these errors, reducing bias. The exponential decrease means even weak learners (barely better than random) combine into a strong classifier given enough rounds.

Proof Sketch

Track the normalization factor ZmZ_m of the weight distribution at each round. Show that Zm=2ϵm(1ϵm)exp(2γm2)Z_m = 2\sqrt{\epsilon_m(1-\epsilon_m)} \leq \exp(-2\gamma_m^2). The training error is bounded by mZm\prod_m Z_m.

Why It Matters

Bagging reduces variance. Boosting reduces bias. This explains why gradient boosting with shallow trees (high bias, low variance base learners) often outperforms bagging with deep trees (low bias, high variance base learners). Boosting is more effective when the base learner is weak.

Failure Mode

Boosting can overfit noisy data. The reweighting mechanism concentrates weight on hard examples, which may be mislabeled or inherently noisy. In high-noise settings, the exponential training error decrease does not translate to test error decrease. Early stopping is the standard remedy.

Stacking

Stacking (stacked generalization) trains a meta-learner on the outputs of base models. Given base models h1,,hMh_1, \ldots, h_M, the meta-learner gg produces:

y^=g(h1(x),h2(x),,hM(x))\hat{y} = g(h_1(x), h_2(x), \ldots, h_M(x))

The meta-learner is trained on out-of-fold predictions to avoid overfitting. Stacking can learn nonlinear combinations of base models, while simple averaging assumes equal, additive contributions.

The theoretical advantage: stacking adapts to the relative strengths of base models across different regions of the input space. If model 1 is better for certain inputs and model 2 for others, the meta-learner can learn this.

Diversity: The Key Ingredient

Definition

Ambiguity Decomposition

The ensemble error equals the average individual error minus the average ambiguity:

E[(hˉ(x)f(x))2]=1Mm=1ME[(hm(x)f(x))2]1Mm=1ME[(hm(x)hˉ(x))2]\mathbb{E}[(\bar{h}(x) - f(x))^2] = \frac{1}{M}\sum_{m=1}^{M}\mathbb{E}[(h_m(x) - f(x))^2] - \frac{1}{M}\sum_{m=1}^{M}\mathbb{E}[(h_m(x) - \bar{h}(x))^2]

The ensemble is always at least as good as the average individual model (since ambiguity is non-negative).

This identity shows that maximizing diversity (ambiguity) is as important as minimizing individual error. Methods for increasing diversity: bootstrap sampling (bagging), random feature subsets (random forests), different model architectures, and different training algorithms.

Common Confusions

Watch Out

More models is not always better

Adding more models decreases the (1ρˉ)σ2M\frac{(1-\bar{\rho})\sigma^2}{M} term, but the limit ρˉσ2\bar{\rho}\sigma^2 is reached quickly. In practice, random forests show diminishing returns past 100-500 trees. The computational cost grows linearly while the accuracy improvement is logarithmic.

Watch Out

Ensembles of strong learners can underperform

If all base models are identical (perfectly correlated), the ensemble is no better than a single model. Ensembling five copies of the same neural network trained from the same initialization with the same data gives ρˉ1\bar{\rho} \approx 1. Diversity requires genuine differences in training procedure, data, or architecture.

Watch Out

Boosting and bagging solve different problems

Bagging reduces variance and works best with unstable, low-bias base learners (deep trees). Boosting reduces bias and works best with stable, high-bias base learners (stumps or shallow trees). Applying bagging to a high-bias model or boosting to a low-bias model is suboptimal.

Canonical Examples

Example

Variance reduction with uncorrelated models

Suppose each of M=25M = 25 base models has variance σ2=1.0\sigma^2 = 1.0 and the average pairwise correlation is ρˉ=0.2\bar{\rho} = 0.2. The ensemble variance is 0.21.0+0.8251.0=0.2+0.032=0.2320.2 \cdot 1.0 + \frac{0.8}{25} \cdot 1.0 = 0.2 + 0.032 = 0.232. This is a 76.8% reduction from a single model. If ρˉ=0.8\bar{\rho} = 0.8 instead: 0.8+0.008=0.8080.8 + 0.008 = 0.808, only a 19.2% reduction. The correlation dominates the ensemble's effectiveness.

Exercises

ExerciseCore

Problem

You have 10 regression models, each with variance σ2=4.0\sigma^2 = 4.0. If the average pairwise correlation of their errors is ρˉ=0.3\bar{\rho} = 0.3, what is the variance of their simple average? What is the minimum achievable variance as MM \to \infty?

ExerciseAdvanced

Problem

Prove the ambiguity decomposition: that the squared error of the ensemble average equals the average squared error of the individual models minus the average squared deviation of individual predictions from the ensemble mean.

ExerciseCore

Problem

In AdaBoost, if each of M=20M = 20 weak learners has edge γm=0.1\gamma_m = 0.1 (accuracy 60% on the reweighted distribution), what is the upper bound on training error?

References

Canonical:

  • Breiman, "Bagging Predictors" (1996), Machine Learning
  • Freund & Schapire, "A Decision-Theoretic Generalization of On-Line Learning" (1997)
  • Krogh & Vedelsby, "Neural Network Ensembles, Cross Validation, and Active Learning" (1995), NeurIPS

Current:

  • Zhou, Ensemble Methods: Foundations and Algorithms (2012), Chapters 2-4, 8

  • Hastie, Tibshirani, Friedman, Elements of Statistical Learning, Chapter 10 (Boosting), Chapter 15 (Random Forests)

  • Bishop, Pattern Recognition and Machine Learning (2006), Chapters 1-14

Next Topics

Last reviewed: April 2026

Prerequisites

Foundations this topic depends on.

Next Topics