ML Methods
Ensemble Methods Theory
Why combining multiple models outperforms any single model: bias-variance decomposition for ensembles, diversity conditions, and the theoretical foundations of bagging, boosting, and stacking.
Prerequisites
Why This Matters
Random forests, gradient boosting, and model stacking dominate tabular data competitions and production ML systems. Understanding why combining models works, and when it does not, requires decomposing ensemble error into bias, variance, and covariance terms. The central insight: averaging reduces variance only when the models being averaged make different errors.
Mental Model
Consider asking 100 people to estimate the weight of an ox. If each person's error is independent, the average of their guesses will be much closer to the true weight than any individual guess. This is the "wisdom of crowds" effect. But if everyone copies the same person's answer, averaging 100 copies gives no improvement. The benefit of averaging is proportional to the diversity of the individual estimators.
Formal Setup
Let be the true function. We have base models and form an ensemble prediction:
Each base model is trained on a (possibly different) sample from distribution .
Ensemble Ambiguity
The ambiguity (or diversity) of an ensemble at point is the average squared deviation of individual predictions from the ensemble prediction:
Higher ambiguity means the models disagree more, which is necessary (but not sufficient) for ensemble improvement.
Ensemble Correlation
The average pairwise correlation between base model errors is:
where . Lower correlation means more effective variance reduction.
Main Theorems
Variance Reduction by Averaging
Statement
If each base model has variance and the average pairwise correlation of errors is , then the variance of the ensemble average is:
As , the ensemble variance converges to .
Intuition
The first term is the irreducible variance from correlated errors. The second term vanishes as you add more models. When models are uncorrelated (), variance decreases as . When perfectly correlated (), averaging does nothing.
Proof Sketch
Write . Separate diagonal terms (, contributing ) from off-diagonal terms (contributing ). Combine and simplify.
Why It Matters
This theorem explains why bagging works. Bootstrap sampling introduces randomness that reduces . Random forests go further by randomizing feature subsets at each split, which reduces correlation below what bagging alone achieves.
Failure Mode
When base models are highly correlated (e.g., slight perturbations of the same model), and averaging provides negligible improvement. This happens when the base learner is stable (insensitive to training data perturbations), such as k-NN with large .
Boosting Reduces Bias
Statement
Given weak classifiers, each with weighted training error on the reweighted distribution at round , the AdaBoost training error satisfies:
where is the edge of the -th weak learner. The training error decreases exponentially in the number of rounds.
Intuition
Each round of boosting focuses on the examples the current ensemble gets wrong. The next weak learner corrects these errors, reducing bias. The exponential decrease means even weak learners (barely better than random) combine into a strong classifier given enough rounds.
Proof Sketch
Track the normalization factor of the weight distribution at each round. Show that . The training error is bounded by .
Why It Matters
Bagging reduces variance. Boosting reduces bias. This explains why gradient boosting with shallow trees (high bias, low variance base learners) often outperforms bagging with deep trees (low bias, high variance base learners). Boosting is more effective when the base learner is weak.
Failure Mode
Boosting can overfit noisy data. The reweighting mechanism concentrates weight on hard examples, which may be mislabeled or inherently noisy. In high-noise settings, the exponential training error decrease does not translate to test error decrease. Early stopping is the standard remedy.
Stacking
Stacking (stacked generalization) trains a meta-learner on the outputs of base models. Given base models , the meta-learner produces:
The meta-learner is trained on out-of-fold predictions to avoid overfitting. Stacking can learn nonlinear combinations of base models, while simple averaging assumes equal, additive contributions.
The theoretical advantage: stacking adapts to the relative strengths of base models across different regions of the input space. If model 1 is better for certain inputs and model 2 for others, the meta-learner can learn this.
Diversity: The Key Ingredient
Ambiguity Decomposition
The ensemble error equals the average individual error minus the average ambiguity:
The ensemble is always at least as good as the average individual model (since ambiguity is non-negative).
This identity shows that maximizing diversity (ambiguity) is as important as minimizing individual error. Methods for increasing diversity: bootstrap sampling (bagging), random feature subsets (random forests), different model architectures, and different training algorithms.
Common Confusions
More models is not always better
Adding more models decreases the term, but the limit is reached quickly. In practice, random forests show diminishing returns past 100-500 trees. The computational cost grows linearly while the accuracy improvement is logarithmic.
Ensembles of strong learners can underperform
If all base models are identical (perfectly correlated), the ensemble is no better than a single model. Ensembling five copies of the same neural network trained from the same initialization with the same data gives . Diversity requires genuine differences in training procedure, data, or architecture.
Boosting and bagging solve different problems
Bagging reduces variance and works best with unstable, low-bias base learners (deep trees). Boosting reduces bias and works best with stable, high-bias base learners (stumps or shallow trees). Applying bagging to a high-bias model or boosting to a low-bias model is suboptimal.
Canonical Examples
Variance reduction with uncorrelated models
Suppose each of base models has variance and the average pairwise correlation is . The ensemble variance is . This is a 76.8% reduction from a single model. If instead: , only a 19.2% reduction. The correlation dominates the ensemble's effectiveness.
Exercises
Problem
You have 10 regression models, each with variance . If the average pairwise correlation of their errors is , what is the variance of their simple average? What is the minimum achievable variance as ?
Problem
Prove the ambiguity decomposition: that the squared error of the ensemble average equals the average squared error of the individual models minus the average squared deviation of individual predictions from the ensemble mean.
Problem
In AdaBoost, if each of weak learners has edge (accuracy 60% on the reweighted distribution), what is the upper bound on training error?
References
Canonical:
- Breiman, "Bagging Predictors" (1996), Machine Learning
- Freund & Schapire, "A Decision-Theoretic Generalization of On-Line Learning" (1997)
- Krogh & Vedelsby, "Neural Network Ensembles, Cross Validation, and Active Learning" (1995), NeurIPS
Current:
-
Zhou, Ensemble Methods: Foundations and Algorithms (2012), Chapters 2-4, 8
-
Hastie, Tibshirani, Friedman, Elements of Statistical Learning, Chapter 10 (Boosting), Chapter 15 (Random Forests)
-
Bishop, Pattern Recognition and Machine Learning (2006), Chapters 1-14
Next Topics
- Decision trees and ensembles: practical construction of tree-based ensembles
- Cross-validation theory: out-of-fold prediction for stacking
Last reviewed: April 2026
Prerequisites
Foundations this topic depends on.
- BaggingLayer 2
- Bootstrap MethodsLayer 2
- Common Probability DistributionsLayer 0A
- Sets, Functions, and RelationsLayer 0A
- Basic Logic and Proof TechniquesLayer 0A
- Decision Trees and EnsemblesLayer 2
- Empirical Risk MinimizationLayer 2
- Concentration InequalitiesLayer 1
- Expectation, Variance, Covariance, and MomentsLayer 0A
- Bias-Variance TradeoffLayer 2
- Gradient BoostingLayer 2
- Gradient Descent VariantsLayer 1
- Convex Optimization BasicsLayer 1
- Differentiation in RnLayer 0A
- Matrix Operations and PropertiesLayer 0A