Ensemble Methods Theory

Sneiderman, Robby

ML Methods

Ensemble Methods Theory

Why combining multiple models outperforms any single model: bias-variance decomposition for ensembles, diversity conditions, and the theoretical foundations of bagging, boosting, and stacking.

CoreTier 2StableSupporting~55 min

Prerequisites

Bagging Gradient Boosting

Start 8-question practice · 1 available Prereq Map

Learning position

Read this page in the graph.

ml-methods | layer 2 | tier 2. This page has 2 direct prerequisites and 0 published dependents.

Open Atlas Prerequisites Leads to

What next

Decision Trees and Ensembles

This is the first curated or graph-derived continuation from the current page.

Evidence badge

Claim status

This page has no public Lean mapping yet. Use the evidence page to inspect how claim status labels work.

Show the backing system

AtlasOpen the full prerequisite graph and run grounding traces.EvidenceInspect source support, claim labels, and public trust status.LeanReview the checked declaration list, scopes, and axiom profile.

Why This Matters

Random forests, gradient boosting, and model stacking dominate tabular data competitions and production ML systems. Understanding why combining models works, and when it does not, requires decomposing ensemble error into bias, variance, and covariance terms. The central insight: averaging reduces variance only when the models being averaged make different errors.

Mental Model

Consider asking 100 people to estimate the weight of an ox. If each person's error is independent, the average of their guesses will be much closer to the true weight than any individual guess. This is the "wisdom of crowds" effect. But if everyone copies the same person's answer, averaging 100 copies gives no improvement. The benefit of averaging is proportional to the diversity of the individual estimators.

Formal Setup

Let $f(x)$ be the true function. We have $M$ base models $h_1, \ldots, h_M$ and form an ensemble prediction:

$\bar{h}(x) = \frac{1}{M}\sum_{m=1}^{M} h_m(x)$

Each base model is trained on a (possibly different) sample from distribution $\mathcal{D}$ .

Definition

Ensemble Ambiguity $A (x)$

The ambiguity (or diversity) of an ensemble at point $x$ is the average squared deviation of individual predictions from the ensemble prediction:

$A(x) = \frac{1}{M}\sum_{m=1}^{M}(h_m(x) - \bar{h}(x))^2$

Higher ambiguity means the models disagree more, which is necessary (but not sufficient) for ensemble improvement.

Definition

Ensemble Correlation $\overset{ρ}{ˉ}$

The average pairwise correlation between base model errors is:

$\bar{\rho} = \frac{1}{M(M-1)}\sum_{m \neq m'}\text{Corr}(\epsilon_m, \epsilon_{m'})$

where $\epsilon_m(x) = h_m(x) - f(x)$ . Lower correlation means more effective variance reduction.

Main Theorems

Theorem

Variance Reduction by Averaging

Statement

If each base model has variance $\sigma^2$ and the average pairwise correlation of errors is $\bar{\rho}$ , then the variance of the ensemble average is:

$\text{Var}(\bar{h}(x)) = \bar{\rho}\sigma^2 + \frac{1 - \bar{\rho}}{M}\sigma^2$

As $M \to \infty$ , the ensemble variance converges to $\bar{\rho}\sigma^2$ .

Intuition

The first term $\bar{\rho}\sigma^2$ is the irreducible variance from correlated errors. The second term $\frac{(1-\bar{\rho})\sigma^2}{M}$ vanishes as you add more models. When models are uncorrelated ( $\bar{\rho} = 0$ ), variance decreases as $\sigma^2/M$ . When perfectly correlated ( $\bar{\rho} = 1$ ), averaging does nothing.

Proof Sketch

Write $\text{Var}(\bar{h}) = \frac{1}{M^2}\sum_{m,m'}\text{Cov}(\epsilon_m, \epsilon_{m'})$ . Separate diagonal terms ( $m = m'$ , contributing $\sigma^2/M$ ) from off-diagonal terms (contributing $\frac{M-1}{M}\bar{\rho}\sigma^2$ ). Combine and simplify.

Why It Matters

This theorem explains why bagging works. Bootstrap sampling introduces randomness that reduces $\bar{\rho}$ . Random forests go further by randomizing feature subsets at each split, which reduces correlation below what bagging alone achieves.

Failure Mode

When base models are highly correlated (e.g., slight perturbations of the same model), $\bar{\rho} \approx 1$ and averaging provides negligible improvement. This happens when the base learner is stable (insensitive to training data perturbations), such as k-NN with large $k$ .

report a correction →

Proposition

Boosting Reduces Bias

Statement

Given $M$ weak classifiers, each with weighted training error $\epsilon_m < 1/2$ on the reweighted distribution at round $m$ , the AdaBoost training error satisfies:

$\hat{R}(\bar{h}) \leq \exp\left(-2\sum_{m=1}^{M}\gamma_m^2\right)$

where $\gamma_m = 1/2 - \epsilon_m$ is the edge of the $m$ -th weak learner. The training error decreases exponentially in the number of rounds.

Intuition

Each round of boosting focuses on the examples the current ensemble gets wrong. The next weak learner corrects these errors, reducing bias. The exponential decrease means even weak learners (barely better than random) combine into a strong classifier given enough rounds.

Proof Sketch

Track the normalization factor $Z_m$ of the weight distribution at each round. Show that $Z_m = 2\sqrt{\epsilon_m(1-\epsilon_m)} \leq \exp(-2\gamma_m^2)$ . The training error is bounded by $\prod_m Z_m$ .

Why It Matters

Bagging reduces variance. Boosting reduces bias. This explains why gradient boosting with shallow trees (high bias, low variance base learners) often outperforms bagging with deep trees (low bias, high variance base learners). Boosting is more effective when the base learner is weak.

Failure Mode

Boosting can overfit noisy data. The reweighting mechanism concentrates weight on hard examples, which may be mislabeled or inherently noisy. In high-noise settings, the exponential training error decrease does not translate to test error decrease. Early stopping is the standard remedy.

report a correction →

Stacking

Stacking (stacked generalization) trains a meta-learner on the outputs of base models. Given base models $h_1, \ldots, h_M$ , the meta-learner $g$ produces:

$\hat{y} = g(h_1(x), h_2(x), \ldots, h_M(x))$

The meta-learner is trained on out-of-fold predictions to avoid overfitting. Stacking can learn nonlinear combinations of base models, while simple averaging assumes equal, additive contributions.

The theoretical advantage: stacking adapts to the relative strengths of base models across different regions of the input space. If model 1 is better for certain inputs and model 2 for others, the meta-learner can learn this.

Diversity: The Key Ingredient

Definition

Ambiguity Decomposition $E [\overset{ˉ}{h}] = \overset{ˉ}{E} - \overset{ˉ}{A}$

The ensemble error equals the average individual error minus the average ambiguity:

$\mathbb{E}[(\bar{h}(x) - f(x))^2] = \frac{1}{M}\sum_{m=1}^{M}\mathbb{E}[(h_m(x) - f(x))^2] - \frac{1}{M}\sum_{m=1}^{M}\mathbb{E}[(h_m(x) - \bar{h}(x))^2]$

The ensemble is always at least as good as the average individual model (since ambiguity is non-negative).

This identity shows that maximizing diversity (ambiguity) is as important as minimizing individual error. Methods for increasing diversity: bootstrap sampling (bagging), random feature subsets (random forests), different model architectures, and different training algorithms.

Common Confusions

Watch Out

More models is not always better

Adding more models decreases the $\frac{(1-\bar{\rho})\sigma^2}{M}$ term, but the limit $\bar{\rho}\sigma^2$ is reached quickly. In practice, random forests show diminishing returns past 100-500 trees. The computational cost grows linearly while the accuracy improvement is logarithmic.

Watch Out

Ensembles of strong learners can underperform

If all base models are identical (perfectly correlated), the ensemble is no better than a single model. Ensembling five copies of the same neural network trained from the same initialization with the same data gives $\bar{\rho} \approx 1$ . Diversity requires genuine differences in training procedure, data, or architecture.

Watch Out

Boosting and bagging solve different problems

Bagging reduces variance and works best with unstable, low-bias base learners (deep trees). Boosting reduces bias and works best with stable, high-bias base learners (stumps or shallow trees). Applying bagging to a high-bias model or boosting to a low-bias model is suboptimal.

Canonical Examples

Example

Variance reduction with uncorrelated models

Suppose each of $M = 25$ base models has variance $\sigma^2 = 1.0$ and the average pairwise correlation is $\bar{\rho} = 0.2$ . The ensemble variance is $0.2 \cdot 1.0 + \frac{0.8}{25} \cdot 1.0 = 0.2 + 0.032 = 0.232$ . This is a 76.8% reduction from a single model. If $\bar{\rho} = 0.8$ instead: $0.8 + 0.008 = 0.808$ , only a 19.2% reduction. The correlation dominates the ensemble's effectiveness.

Exercises

ExerciseCore

Problem

You have 10 regression models, each with variance $\sigma^2 = 4.0$ . If the average pairwise correlation of their errors is $\bar{\rho} = 0.3$ , what is the variance of their simple average? What is the minimum achievable variance as $M \to \infty$ ?

ExerciseAdvanced

Problem

Prove the ambiguity decomposition: that the squared error of the ensemble average equals the average squared error of the individual models minus the average squared deviation of individual predictions from the ensemble mean.

ExerciseCore

Problem

In AdaBoost, if each of $M = 20$ weak learners has edge $\gamma_m = 0.1$ (accuracy 60% on the reweighted distribution), what is the upper bound on training error?

References

Canonical:

Breiman, "Bagging Predictors" (1996), Machine Learning
Freund & Schapire, "A Decision-Theoretic Generalization of On-Line Learning" (1997)
Krogh & Vedelsby, "Neural Network Ensembles, Cross Validation, and Active Learning" (1995), NeurIPS

Current:

Zhou, Ensemble Methods: Foundations and Algorithms (2012), Chapters 2-4, 8
Hastie, Tibshirani, Friedman, Elements of Statistical Learning, Chapter 10 (Boosting), Chapter 15 (Random Forests)

Next Topics

Decision trees and ensembles: practical construction of tree-based ensembles
Cross-validation theory: out-of-fold prediction for stacking

Last reviewed: April 14, 2026

Canonical graph

Required before and derived from this topic

These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.

Full prerequisite chain All derived topics

Required prerequisites

2

Bagginglayer 2 · tier 1
Gradient Boostinglayer 2 · tier 1

Derived topics

0

No published topic currently declares this as a prerequisite.