Beta. Content is under active construction and has not been peer-reviewed. Report errors on GitHub.Disclaimer

ML Methods

Decision Trees and Ensembles

Greedy recursive partitioning with splitting criteria, pruning, and why combining weak learners via bagging (random forests) and boosting (gradient boosting) yields strong predictors.

AdvancedTier 2Stable~60 min

Why This Matters

Decision trees are the building block for the most practically successful ML methods. Random forests and gradient-boosted trees (XGBoost, LightGBM) dominate tabular data competitions and production systems. Understanding the theory. why single trees overfit, why bagging reduces variance, why boosting reduces bias. gives you the mental model for choosing and tuning these methods correctly.

Mental Model

A decision tree partitions feature space into axis-aligned rectangles, then predicts a constant in each rectangle. Training is greedy: at each node, pick the split that most reduces impurity. A single deep tree memorizes noise (high variance, low bias). Ensembles fix this by averaging many trees (bagging) or sequentially correcting errors (boosting).

Splitting Criteria

Definition

Gini Impurity

For a classification node with KK classes and class proportions p1,,pKp_1, \ldots, p_K, the Gini impurity is:

G(S)=k=1Kpk(1pk)=1k=1Kpk2G(S) = \sum_{k=1}^{K} p_k(1 - p_k) = 1 - \sum_{k=1}^{K} p_k^2

Gini impurity is zero when all samples belong to one class (pure node) and maximized when classes are uniformly distributed.

Definition

Information Gain (Entropy)

The entropy of a node is:

H(S)=k=1Kpklog2pkH(S) = -\sum_{k=1}^{K} p_k \log_2 p_k

A split on feature jj at threshold tt divides SS into SLS_L and SRS_R. The information gain is:

IG(S,j,t)=H(S)SLSH(SL)SRSH(SR)\text{IG}(S, j, t) = H(S) - \frac{|S_L|}{|S|} H(S_L) - \frac{|S_R|}{|S|} H(S_R)

The greedy algorithm picks the (j,t)(j, t) maximizing information gain.

Definition

Variance Reduction (Regression)

For regression trees, the impurity of a node is the variance of the targets:

V(S)=1SiS(yiyˉS)2V(S) = \frac{1}{|S|} \sum_{i \in S} (y_i - \bar{y}_S)^2

A split is chosen to maximize the reduction in weighted variance across child nodes. The prediction at each leaf is the mean of the targets in that leaf.

Greedy Recursive Partitioning

The tree-building algorithm is inherently greedy:

  1. At the current node, evaluate all features jj and all thresholds tt
  2. Pick the (j,t)(j, t) that maximally reduces impurity
  3. Split the data into left (xjtx_j \leq t) and right (xj>tx_j > t)
  4. Recurse on each child until a stopping condition is met

Stopping conditions include maximum depth, minimum samples per leaf, or minimum impurity decrease. Without stopping, a tree can achieve zero training error by creating one leaf per sample. pure memorization.

Pruning

Pre-pruning (early stopping) limits tree growth during construction. Post-pruning (cost-complexity pruning) grows a full tree, then removes subtrees that do not sufficiently reduce a penalized loss:

Rα(T)=R^(T)+αTR_\alpha(T) = \hat{R}(T) + \alpha |T|

where T|T| is the number of leaves and α\alpha controls the penalty. The optimal α\alpha is chosen by cross-validation.

Random Forests

Definition

Random Forest

A random forest builds BB trees, each trained on a bootstrap sample of size nn drawn with replacement from the training data. At each split, only a random subset of mm features (typically mdm \approx \sqrt{d}) is considered. The final prediction averages the individual tree predictions (regression) or takes a majority vote (classification).

The two sources of randomness. bootstrap sampling (bagging) and feature subsampling. decorrelate the trees, which is essential for variance reduction.

Proposition

Variance Reduction via Averaging

Statement

If BB estimators each have variance σ2\sigma^2 and pairwise correlation ρ\rho, the variance of their average is:

Var(1Bb=1Bfb)=ρσ2+1ρBσ2\text{Var}\left(\frac{1}{B}\sum_{b=1}^{B} f_b\right) = \rho \sigma^2 + \frac{1 - \rho}{B}\sigma^2

Intuition

As BB \to \infty, the second term vanishes but the first remains. The irreducible term ρσ2\rho\sigma^2 is why decorrelation matters: reducing ρ\rho (via feature subsampling) directly reduces the ensemble variance. With perfectly correlated trees (ρ=1\rho = 1), averaging gives no benefit.

Proof Sketch

Expand Var(1Bfb)=1B2[Var(fb)+bbCov(fb,fb)]\text{Var}(\frac{1}{B}\sum f_b) = \frac{1}{B^2}[\sum \text{Var}(f_b) + \sum_{b \neq b'} \text{Cov}(f_b, f_{b'})]. There are BB variance terms and B(B1)B(B-1) covariance terms. With Cov(fb,fb)=ρσ2\text{Cov}(f_b, f_{b'}) = \rho\sigma^2, the result follows by algebra.

Why It Matters

This formula explains the entire design of random forests. Bagging (bootstrap) creates diverse trees. Feature subsampling reduces ρ\rho. Together they make averaging effective. It also explains why random forests are hard to overfit by adding more trees. more trees only reduces the second term.

Failure Mode

The formula assumes identical variances and pairwise correlations. In practice, trees have different qualities. Also, bootstrap samples overlap (~63% unique data), so trees are not truly independent. The effective ρ\rho is not zero.

Gradient Boosting

Definition

Gradient Boosting

Gradient boosting builds an ensemble sequentially. At round mm, it fits a new tree hmh_m to the negative gradient of the loss evaluated at the current ensemble prediction:

Fm(x)=Fm1(x)+ηhm(x)F_m(x) = F_{m-1}(x) + \eta \, h_m(x)

where hmh_m is fitted to the pseudo-residuals ri(m)=L(yi,Fm1(xi))Fm1(xi)r_i^{(m)} = -\frac{\partial L(y_i, F_{m-1}(x_i))}{\partial F_{m-1}(x_i)}, and η\eta is the learning rate (shrinkage).

For squared loss, pseudo-residuals are just the ordinary residuals yiFm1(xi)y_i - F_{m-1}(x_i).

Gradient boosting is functional gradient descent: each tree is a step in function space that moves the ensemble toward lower loss. Small learning rates (η1\eta \ll 1) with many rounds typically outperform large rates with few rounds, at the cost of more computation.

Bagging vs. Boosting

Bagging and boosting attack the bias-variance tradeoff from opposite ends:

  • Bagging (random forests): takes high-variance low-bias learners (deep trees) and reduces variance by averaging. Bias stays roughly the same.
  • Boosting: takes high-bias low-variance learners (shallow trees, often depth 1-6) and reduces bias by sequential correction. Each round slightly increases variance but substantially reduces bias.

This is why random forests use deep trees (no pruning) while gradient boosting uses shallow trees (depth 3-8 typically).

Canonical Examples

Example

Gini vs. entropy on a binary split

Consider a node with 300 samples from class 0 and 100 from class 1 (p1=0.25p_1 = 0.25). Gini impurity is 2(0.75)(0.25)=0.3752(0.75)(0.25) = 0.375. Entropy is 0.75log2(0.75)0.25log2(0.25)0.811-0.75\log_2(0.75) - 0.25\log_2(0.25) \approx 0.811. After a split producing a pure left child (200 class 0) and a mixed right child (100 class 0, 100 class 1), both criteria prefer the same split. in practice, Gini and entropy almost always agree on the best split.

Example

Why a single deep tree overfits

On a dataset of 1000 points with 20 features, an unpruned tree can create up to 1000 leaves, fitting every training point exactly. The training error is zero but test error is high. The tree has memorized the noise. Random forests fix this: 500 such trees, each trained on a bootstrap sample with m=20=4m = \lfloor\sqrt{20}\rfloor = 4 features per split, yield smooth averaged predictions with much lower test error.

Common Confusions

Watch Out

Random forests can still overfit with too few trees per the formula

The variance formula shows that adding trees never hurts (the 1ρB\frac{1-\rho}{B} term only decreases). But this holds for the statistical variance. in practice, computational cost grows linearly with BB, and returns diminish rapidly. Also, bagging does not reduce bias, so if individual trees are biased (e.g., very shallow), more trees will not help.

Watch Out

Boosting is not bagging done sequentially

Bagging builds trees independently in parallel; boosting builds them sequentially with each tree correcting the previous ensemble. They have structurally different theoretical justifications: bagging is variance reduction via averaging, boosting is bias reduction via functional gradient descent.

Summary

  • Single decision trees are greedy partitioners: they split to maximize impurity reduction (Gini, entropy, or variance)
  • Unpruned trees have low bias but high variance. They memorize noise
  • Random forests decorrelate trees via bootstrap + feature subsampling, then average to reduce variance
  • Ensemble variance =ρσ2+(1ρ)σ2/B= \rho\sigma^2 + (1-\rho)\sigma^2/B. decorrelation (ρ\rho small) is the key ingredient
  • Gradient boosting sequentially fits trees to pseudo-residuals. It is functional gradient descent in function space
  • Bagging reduces variance; boosting reduces bias

Exercises

ExerciseCore

Problem

Compute the Gini impurity of a node with 60% class A and 40% class B. Then compute the entropy.

ExerciseAdvanced

Problem

Derive the variance of the average of BB estimators with pairwise correlation ρ\rho and individual variance σ2\sigma^2. Show that as BB \to \infty, the variance approaches ρσ2\rho\sigma^2.

ExerciseResearch

Problem

Gradient boosting with squared loss fits trees to residuals yiFm1(xi)y_i - F_{m-1}(x_i). What are the pseudo-residuals for logistic loss L(y,F)=log(1+eyF)L(y, F) = \log(1 + e^{-yF}) where y{1,+1}y \in \{-1, +1\}? How does this relate to Newton boosting (XGBoost)?

References

Canonical:

  • Breiman, "Random Forests" (2001). The original paper
  • Friedman, "Greedy Function Approximation: A Gradient Boosting Machine" (2001)
  • Hastie, Tibshirani, Friedman, Elements of Statistical Learning, Chapters 9-10

Current:

  • Chen & Guestrin, "XGBoost: A Scalable Tree Boosting System" (2016)

  • Bishop, Pattern Recognition and Machine Learning (2006), Chapters 1-14

  • Murphy, Machine Learning: A Probabilistic Perspective (2012), Chapters 1-28

Next Topics

The natural next steps from trees and ensembles:

  • Neural network fundamentals: a different approach to function approximation
  • Model selection: how to choose between forests, boosting, and other methods

Last reviewed: April 2026

Prerequisites

Foundations this topic depends on.

Builds on This

Next Topics