ML Methods
Decision Trees and Ensembles
Greedy recursive partitioning with splitting criteria, pruning, and why combining weak learners via bagging (random forests) and boosting (gradient boosting) yields strong predictors.
Prerequisites
Why This Matters
Decision trees are the building block for the most practically successful ML methods. Random forests and gradient-boosted trees (XGBoost, LightGBM) dominate tabular data competitions and production systems. Understanding the theory. why single trees overfit, why bagging reduces variance, why boosting reduces bias. gives you the mental model for choosing and tuning these methods correctly.
Mental Model
A decision tree partitions feature space into axis-aligned rectangles, then predicts a constant in each rectangle. Training is greedy: at each node, pick the split that most reduces impurity. A single deep tree memorizes noise (high variance, low bias). Ensembles fix this by averaging many trees (bagging) or sequentially correcting errors (boosting).
Splitting Criteria
Gini Impurity
For a classification node with classes and class proportions , the Gini impurity is:
Gini impurity is zero when all samples belong to one class (pure node) and maximized when classes are uniformly distributed.
Information Gain (Entropy)
The entropy of a node is:
A split on feature at threshold divides into and . The information gain is:
The greedy algorithm picks the maximizing information gain.
Variance Reduction (Regression)
For regression trees, the impurity of a node is the variance of the targets:
A split is chosen to maximize the reduction in weighted variance across child nodes. The prediction at each leaf is the mean of the targets in that leaf.
Greedy Recursive Partitioning
The tree-building algorithm is inherently greedy:
- At the current node, evaluate all features and all thresholds
- Pick the that maximally reduces impurity
- Split the data into left () and right ()
- Recurse on each child until a stopping condition is met
Stopping conditions include maximum depth, minimum samples per leaf, or minimum impurity decrease. Without stopping, a tree can achieve zero training error by creating one leaf per sample. pure memorization.
Pruning
Pre-pruning (early stopping) limits tree growth during construction. Post-pruning (cost-complexity pruning) grows a full tree, then removes subtrees that do not sufficiently reduce a penalized loss:
where is the number of leaves and controls the penalty. The optimal is chosen by cross-validation.
Random Forests
Random Forest
A random forest builds trees, each trained on a bootstrap sample of size drawn with replacement from the training data. At each split, only a random subset of features (typically ) is considered. The final prediction averages the individual tree predictions (regression) or takes a majority vote (classification).
The two sources of randomness. bootstrap sampling (bagging) and feature subsampling. decorrelate the trees, which is essential for variance reduction.
Variance Reduction via Averaging
Statement
If estimators each have variance and pairwise correlation , the variance of their average is:
Intuition
As , the second term vanishes but the first remains. The irreducible term is why decorrelation matters: reducing (via feature subsampling) directly reduces the ensemble variance. With perfectly correlated trees (), averaging gives no benefit.
Proof Sketch
Expand . There are variance terms and covariance terms. With , the result follows by algebra.
Why It Matters
This formula explains the entire design of random forests. Bagging (bootstrap) creates diverse trees. Feature subsampling reduces . Together they make averaging effective. It also explains why random forests are hard to overfit by adding more trees. more trees only reduces the second term.
Failure Mode
The formula assumes identical variances and pairwise correlations. In practice, trees have different qualities. Also, bootstrap samples overlap (~63% unique data), so trees are not truly independent. The effective is not zero.
Gradient Boosting
Gradient Boosting
Gradient boosting builds an ensemble sequentially. At round , it fits a new tree to the negative gradient of the loss evaluated at the current ensemble prediction:
where is fitted to the pseudo-residuals , and is the learning rate (shrinkage).
For squared loss, pseudo-residuals are just the ordinary residuals .
Gradient boosting is functional gradient descent: each tree is a step in function space that moves the ensemble toward lower loss. Small learning rates () with many rounds typically outperform large rates with few rounds, at the cost of more computation.
Bagging vs. Boosting
Bagging and boosting attack the bias-variance tradeoff from opposite ends:
- Bagging (random forests): takes high-variance low-bias learners (deep trees) and reduces variance by averaging. Bias stays roughly the same.
- Boosting: takes high-bias low-variance learners (shallow trees, often depth 1-6) and reduces bias by sequential correction. Each round slightly increases variance but substantially reduces bias.
This is why random forests use deep trees (no pruning) while gradient boosting uses shallow trees (depth 3-8 typically).
Canonical Examples
Gini vs. entropy on a binary split
Consider a node with 300 samples from class 0 and 100 from class 1 (). Gini impurity is . Entropy is . After a split producing a pure left child (200 class 0) and a mixed right child (100 class 0, 100 class 1), both criteria prefer the same split. in practice, Gini and entropy almost always agree on the best split.
Why a single deep tree overfits
On a dataset of 1000 points with 20 features, an unpruned tree can create up to 1000 leaves, fitting every training point exactly. The training error is zero but test error is high. The tree has memorized the noise. Random forests fix this: 500 such trees, each trained on a bootstrap sample with features per split, yield smooth averaged predictions with much lower test error.
Common Confusions
Random forests can still overfit with too few trees per the formula
The variance formula shows that adding trees never hurts (the term only decreases). But this holds for the statistical variance. in practice, computational cost grows linearly with , and returns diminish rapidly. Also, bagging does not reduce bias, so if individual trees are biased (e.g., very shallow), more trees will not help.
Boosting is not bagging done sequentially
Bagging builds trees independently in parallel; boosting builds them sequentially with each tree correcting the previous ensemble. They have structurally different theoretical justifications: bagging is variance reduction via averaging, boosting is bias reduction via functional gradient descent.
Summary
- Single decision trees are greedy partitioners: they split to maximize impurity reduction (Gini, entropy, or variance)
- Unpruned trees have low bias but high variance. They memorize noise
- Random forests decorrelate trees via bootstrap + feature subsampling, then average to reduce variance
- Ensemble variance . decorrelation ( small) is the key ingredient
- Gradient boosting sequentially fits trees to pseudo-residuals. It is functional gradient descent in function space
- Bagging reduces variance; boosting reduces bias
Exercises
Problem
Compute the Gini impurity of a node with 60% class A and 40% class B. Then compute the entropy.
Problem
Derive the variance of the average of estimators with pairwise correlation and individual variance . Show that as , the variance approaches .
Problem
Gradient boosting with squared loss fits trees to residuals . What are the pseudo-residuals for logistic loss where ? How does this relate to Newton boosting (XGBoost)?
References
Canonical:
- Breiman, "Random Forests" (2001). The original paper
- Friedman, "Greedy Function Approximation: A Gradient Boosting Machine" (2001)
- Hastie, Tibshirani, Friedman, Elements of Statistical Learning, Chapters 9-10
Current:
-
Chen & Guestrin, "XGBoost: A Scalable Tree Boosting System" (2016)
-
Bishop, Pattern Recognition and Machine Learning (2006), Chapters 1-14
-
Murphy, Machine Learning: A Probabilistic Perspective (2012), Chapters 1-28
Next Topics
The natural next steps from trees and ensembles:
- Neural network fundamentals: a different approach to function approximation
- Model selection: how to choose between forests, boosting, and other methods
Last reviewed: April 2026
Prerequisites
Foundations this topic depends on.
Builds on This
- AdaBoostLayer 2
- BaggingLayer 2
- Cubist and Model TreesLayer 2
- Gradient BoostingLayer 2
- Random ForestsLayer 2