Decision Trees and Ensembles

Sneiderman, Robby

ML Methods

Decision Trees and Ensembles

Greedy recursive partitioning with splitting criteria, pruning, and why combining weak learners via bagging (random forests) and boosting (gradient boosting) yields strong predictors.

AdvancedTier 2StableSupporting~60 min

Prerequisites

Empirical Risk Minimization Bias Variance Tradeoff Knn

Start 8-question practice · 3 available Prereq Map

Learning position

Read this page in the graph.

ml-methods | layer 2 | tier 2. This page has 3 direct prerequisites and 8 published dependents.

Open Atlas Prerequisites Leads to

What next

Feedforward Networks and Backpropagation

This is the first curated or graph-derived continuation from the current page.

Evidence badge

Claim status

This page has no public Lean mapping yet. Use the evidence page to inspect how claim status labels work.

Show the backing system

AtlasOpen the full prerequisite graph and run grounding traces.EvidenceInspect source support, claim labels, and public trust status.LeanReview the checked declaration list, scopes, and axiom profile.

Why This Matters

Decision trees are the building block for the most practically successful ML methods. Random forests and gradient-boosted trees (XGBoost, LightGBM) dominate tabular data competitions and production systems. Understanding the theory, why single trees overfit, why bagging reduces variance, why boosting reduces bias, gives you the mental model for choosing and tuning these methods correctly.

theorem visual

Single tree, forest, boosting

$Tree methods use the same primitive, an axis-aligned split, but the ensemble recipe changes the bias-variance behavior.$

split

$A tree makes local rules like x_{j} \leq t and predicts a constant in each leaf.$

bagging

$A forest averages many noisy trees. The limit variance is roughly ρ σ^{2}, so decorrelation matters.$

boosting

$Boosting adds trees sequentially: F_{m} = F_{m - 1} + η h_{m} where h_{m} fits pseudo-residuals.$

Mental Model

A decision tree partitions feature space into axis-aligned rectangles, then predicts a constant in each rectangle. Training is greedy: at each node, pick the split that most reduces impurity. A single deep tree memorizes noise (high variance, low bias). Ensembles fix this by averaging many trees (bagging) or sequentially correcting errors (boosting).

Splitting Criteria

Definition

Gini Impurity $G (S)$

For a classification node with $K$ classes and class proportions $p_1, \ldots, p_K$ , the Gini impurity is:

$G(S) = \sum_{k=1}^{K} p_k(1 - p_k) = 1 - \sum_{k=1}^{K} p_k^2$

Gini impurity is zero when all samples belong to one class (pure node) and maximized when classes are uniformly distributed.

Definition

Information Gain (Entropy) $H (S)$

The entropy of a node is:

$H(S) = -\sum_{k=1}^{K} p_k \log_2 p_k$

A split on feature $j$ at threshold $t$ divides $S$ into $S_L$ and $S_R$ . The information gain is:

$\text{IG}(S, j, t) = H(S) - \frac{|S_L|}{|S|} H(S_L) - \frac{|S_R|}{|S|} H(S_R)$

The greedy algorithm picks the $(j, t)$ maximizing information gain.

Definition

Variance Reduction (Regression)

For regression trees, the impurity of a node is the variance of the targets:

$V(S) = \frac{1}{|S|} \sum_{i \in S} (y_i - \bar{y}_S)^2$

A split is chosen to maximize the reduction in weighted variance across child nodes. The prediction at each leaf is the mean of the targets in that leaf.

Greedy Recursive Partitioning

The tree-building algorithm is inherently greedy:

At the current node, evaluate all features $j$ and all thresholds $t$
Pick the $(j, t)$ that maximally reduces impurity
Split the data into left ( $x_j \leq t$ ) and right ( $x_j > t$ )
Recurse on each child until a stopping condition is met

Stopping conditions include maximum depth, minimum samples per leaf, or minimum impurity decrease. Without stopping, a tree can achieve zero training error by creating one leaf per sample: pure memorization.

Pruning

Pre-pruning (early stopping) limits tree growth during construction. Post-pruning (cost-complexity pruning) grows a full tree, then removes subtrees that do not sufficiently reduce a penalized loss:

$R_\alpha(T) = \hat{R}(T) + \alpha |T|$

where $|T|$ is the number of leaves and $\alpha$ controls the penalty. The optimal $\alpha$ is chosen by cross-validation.

Random Forests

Definition

Random Forest

A random forest builds $B$ trees, each trained on a bootstrap sample of size $n$ drawn with replacement from the training data. At each split, only a random subset of $m$ features (typically $m \approx \sqrt{d}$ ) is considered. The final prediction averages the individual tree predictions (regression) or takes a majority vote (classification).

The two sources of randomness, bootstrap sampling (bagging) and feature subsampling, decorrelate the trees, which is essential for variance reduction.

Proposition

Variance Reduction via Averaging

Statement

If $B$ estimators each have variance $\sigma^2$ and pairwise correlation $\rho$ , the variance of their average is:

$\text{Var}\left(\frac{1}{B}\sum_{b=1}^{B} f_b\right) = \rho \sigma^2 + \frac{1 - \rho}{B}\sigma^2$

Intuition

As $B \to \infty$ , the second term vanishes but the first remains. The irreducible term $\rho\sigma^2$ is why decorrelation matters: reducing $\rho$ (via feature subsampling) directly reduces the ensemble variance. With perfectly correlated trees ( $\rho = 1$ ), averaging gives no benefit.

Proof Sketch

Expand $\text{Var}(\frac{1}{B}\sum f_b) = \frac{1}{B^2}[\sum \text{Var}(f_b) + \sum_{b \neq b'} \text{Cov}(f_b, f_{b'})]$ . There are $B$ variance terms and $B(B-1)$ covariance terms. With $\text{Cov}(f_b, f_{b'}) = \rho\sigma^2$ , the result follows by algebra.

Why It Matters

This formula explains the entire design of random forests. Bagging (bootstrap) creates diverse trees. Feature subsampling reduces $\rho$ . Together they make averaging effective. It also explains why random forests are hard to overfit by adding more trees. more trees only reduces the second term.

Failure Mode

The formula assumes identical variances and pairwise correlations. In practice, trees have different qualities. Also, bootstrap samples overlap (~63% unique data), so trees are not truly independent. The effective $\rho$ is not zero.

report a correction →

Gradient Boosting

Definition

Gradient Boosting

Gradient boosting builds an ensemble sequentially. At round $m$ , it fits a new tree $h_m$ to the negative gradient of the loss evaluated at the current ensemble prediction:

$F_m(x) = F_{m-1}(x) + \eta \, h_m(x)$

where $h_m$ is fitted to the pseudo-residuals $r_i^{(m)} = -\frac{\partial L(y_i, F_{m-1}(x_i))}{\partial F_{m-1}(x_i)}$ , and $\eta$ is the learning rate (shrinkage).

For squared loss, pseudo-residuals are just the ordinary residuals $y_i - F_{m-1}(x_i)$ .

Gradient boosting is functional gradient descent: each tree is a step in function space that moves the ensemble toward lower loss. Small learning rates ( $\eta \ll 1$ ) with many rounds typically outperform large rates with few rounds, at the cost of more computation.

Bagging vs. Boosting

Bagging and boosting attack the bias-variance tradeoff from opposite ends:

Bagging (random forests): takes high-variance low-bias learners (deep trees) and reduces variance by averaging. Bias stays roughly the same.
Boosting: takes high-bias low-variance learners (shallow trees, often depth 1-6) and reduces bias by sequential correction. Each round slightly increases variance but substantially reduces bias.

This is why random forests use deep trees (no pruning) while gradient boosting uses shallow trees (depth 3-8 typically).

Canonical Examples

Example

Gini vs. entropy on a binary split

Consider a node with 300 samples from class 0 and 100 from class 1 ( $p_1 = 0.25$ ). Gini impurity is $2(0.75)(0.25) = 0.375$ . Entropy is $-0.75\log_2(0.75) - 0.25\log_2(0.25) \approx 0.811$ . After a split producing a pure left child (200 class 0) and a mixed right child (100 class 0, 100 class 1), both criteria prefer the same split. in practice, Gini and entropy almost always agree on the best split.

Example

Why a single deep tree overfits

On a dataset of 1000 points with 20 features, an unpruned tree can create up to 1000 leaves, fitting every training point exactly. The training error is zero but test error is high. The tree has memorized the noise. Random forests fix this: 500 such trees, each trained on a bootstrap sample with $m = \lfloor\sqrt{20}\rfloor = 4$ features per split, yield smooth averaged predictions with much lower test error.

Common Confusions

Watch Out

Random forests can still overfit with too few trees per the formula

The variance formula shows that adding trees never hurts (the $\frac{1-\rho}{B}$ term only decreases). But this holds for the statistical variance. in practice, computational cost grows linearly with $B$ , and returns diminish rapidly. Also, bagging does not reduce bias, so if individual trees are biased (e.g., very shallow), more trees will not help.

Watch Out

Boosting is not bagging done sequentially

Bagging builds trees independently in parallel; boosting builds them sequentially with each tree correcting the previous ensemble. They have structurally different theoretical justifications: bagging is variance reduction via averaging, boosting is bias reduction via functional gradient descent.

Summary

Single decision trees are greedy partitioners: they split to maximize impurity reduction (Gini, entropy, or variance)
Unpruned trees have low bias but high variance. They memorize noise
Random forests decorrelate trees via bootstrap + feature subsampling, then average to reduce variance
Ensemble variance $= \rho\sigma^2 + (1-\rho)\sigma^2/B$ ; decorrelation ( $\rho$ small) is the key ingredient
Gradient boosting sequentially fits trees to pseudo-residuals. It is functional gradient descent in function space
Bagging reduces variance; boosting reduces bias

Optional Deeper DetailWeakest-link pruning: the unique sequence of nested CART subtreesShow

Advanced section adapted from Hastie, Tibshirani, Friedman, The Elements of Statistical Learning (2nd ed., Springer 2009), §9.2.2 "Regression Trees," pp. 307-309, and Breiman, Friedman, Olshen, Stone (1984), Classification and Regression Trees, Chapter 3.

Cost-complexity pruning seems to require trying every subtree of the fully-grown tree $T_0$ . There are exponentially many. The CART pruning algorithm is much better than this: it produces a unique nested sequence of subtrees $T_0 \supset T_1 \supset \cdots \supset T_{\text{root}}$ , indexed by an increasing sequence of penalty values $0 = \alpha_0 < \alpha_1 < \cdots$ , such that the minimizer of $R_\alpha(T)$ is exactly $T_m$ for $\alpha \in [\alpha_m, \alpha_{m+1})$ . Picking the right $\alpha$ by cross-validation then picks one subtree from this sequence, not from the exponential set.

Weakest-link statistic. For any internal node $t$ in tree $T$ , let $T_t$ be the subtree rooted at $t$ , $|T_t|$ its number of leaves, $\hat R(T_t)$ the within-tree resubstitution risk, and $\hat R(t)$ the risk if $t$ were collapsed to a leaf. The link strength at $t$ is

$g(t, T) \;=\; \frac{\hat R(t) - \hat R(T_t)}{|T_t| - 1}.$

$g(t, T)$ is the per-leaf gain that the subtree $T_t$ achieves relative to making $t$ a leaf. The weakest link is the node $t^*$ with the smallest $g(t^*, T)$ : it is the least-informative branch in the tree.

The pruning algorithm.

Set $T_0$ to the fully-grown tree, $\alpha_0 = 0$ .
For $m = 0, 1, 2, \ldots$ : find the weakest-link node $t^*_m$ in $T_m$ and set $\alpha_{m+1} = g(t^*_m, T_m)$ . Form $T_{m+1}$ by collapsing $T_{t^*_m}$ into a single leaf.
Stop when $T_m$ is the root.

The key theorem (Breiman-Friedman-Olshen-Stone 1984, Chapter 10). For $\alpha \in [\alpha_m, \alpha_{m+1})$ , the global minimizer of $R_\alpha(T) = \hat R(T) + \alpha |T|$ over all subtrees of $T_0$ is exactly $T_m$ . Furthermore, the sequence $T_0 \supset T_1 \supset \cdots$ is unique and contains every subtree that minimizes $R_\alpha$ for some $\alpha > 0$ .

Why this works. The argument has two parts. First, if $T$ is the minimizer for some $\alpha$ and $t^*$ has $g(t^*, T) < \alpha$ , then collapsing $T_{t^*}$ to a leaf decreases $R_\alpha$ , so $T$ was not minimal. So every minimizer prunes all weakest links with $g < \alpha$ . Second, when $\alpha$ crosses the threshold $\alpha_{m+1} = g(t^*_m, T_m)$ , exactly one branch transitions from "keep" to "prune," giving the next subtree in the sequence.

Cross-validation step. With the sequence $\{T_0, T_1, \ldots\}$ and the breakpoints $\{\alpha_m\}$ from the full training set, CV is used to choose the right $\alpha$ . The standard recipe (ESL §9.2.2): fit $T_0$ and the pruning sequence on each CV training fold, evaluate at the $\sqrt{\alpha_m \alpha_{m+1}}$ midpoints, average to get a CV curve, pick the $\alpha$ with the lowest CV error (or apply the one-standard-error rule from cross-validation theory).

Why pruning, not pre-stopping? A short answer: pre-stopping (limit max depth, min samples per leaf) can miss good splits that need a "lookahead" through a poor split. Post-pruning grows the full tree first, then trims back, and so can recover splits that pre-stopping would have rejected. ESL §9.2.2 makes this argument explicit with a synthetic example where pre-stopping prunes the optimal split.

Exercises

ExerciseCore

Problem

Compute the Gini impurity of a node with class proportions 0.60 for A and 0.40 for B. Then compute the entropy.

ExerciseAdvanced

Problem

Derive the variance of the average of $B$ estimators with pairwise correlation $\rho$ and individual variance $\sigma^2$ . Show that as $B \to \infty$ , the variance approaches $\rho\sigma^2$ .

ExerciseResearch

Problem

Gradient boosting with squared loss fits trees to residuals $y_i - F_{m-1}(x_i)$ . What are the pseudo-residuals for logistic loss $L(y, F) = \log(1 + e^{-yF})$ where $y \in \{-1, +1\}$ ? How does this relate to Newton boosting (XGBoost)?

References

Canonical:

Breiman, Friedman, Olshen, and Stone, Classification and Regression Trees (1984), Ch. 1-4
Breiman, "Random Forests" (Machine Learning, 2001)
Friedman, "Greedy Function Approximation: A Gradient Boosting Machine" (2001)
Hastie, T., Tibshirani, R., and Friedman, J. (2009). The Elements of Statistical Learning, 2nd ed. Springer. §9.2 (regression trees and CART), §9.2.2 (weakest-link pruning sequence), §10 (boosting and additive trees), §10.10 (numerical optimization via functional gradient descent), §15 (random forests and the variance-reduction view).

Current:

Chen & Guestrin, "XGBoost: A Scalable Tree Boosting System" (2016)
Ke et al., "LightGBM: A Highly Efficient Gradient Boosting Decision Tree" (NeurIPS 2017)

Next Topics

The natural next steps from trees and ensembles:

Neural network fundamentals: a different approach to function approximation
Model selection: how to choose between forests, boosting, and other methods

Last reviewed: April 22, 2026

Canonical graph

Required before and derived from this topic

These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.

Full prerequisite chain All derived topics

Required prerequisites

3

Empirical Risk Minimizationlayer 2 · tier 1
K-Nearest Neighborslayer 1 · tier 2
Bias-Variance Tradeofflayer 2 · tier 2

Derived topics

8

AIC and BIClayer 2 · tier 1
Bagginglayer 2 · tier 1
Feedforward Networks and Backpropagationlayer 2 · tier 1
Gradient Boostinglayer 2 · tier 1
Random Forestslayer 2 · tier 1

+3 more on the derived-topics page.

Graph-backed continuations

Feedforward Networks and Backpropagation AIC and BIC AdaBoost Bagging Cubist and Model Trees Feature Importance and Interpretability Gradient Boosting Random Forests