ML Methods
Random Forests
Random forests combine bagging with random feature subsampling to decorrelate trees, reducing ensemble variance beyond what pure bagging achieves. Out-of-bag estimation, variable importance, consistency theory, and practical strengths and weaknesses.
Prerequisites
Why This Matters
Random forests are one of the most reliable ensemble methods in practice. On tabular data, they consistently perform well with minimal tuning. They handle mixed feature types, are robust to outliers, require few hyperparameters, and provide built-in variable importance measures. Understanding why they work. Not just that they work. requires understanding how feature subsampling decorrelates decision trees and why this reduces ensemble variance.
Mental Model
A single deep decision tree memorizes noise (high variance, low bias). Bagging reduces variance by averaging many trees trained on bootstrap samples. But bagged trees are correlated: if one feature is strongly predictive, all trees split on it first, producing similar trees. Feature subsampling forces each tree to ignore most features at each split, creating diverse trees that make different errors. The errors cancel when you average.
Random forests = bagging + feature subsampling. Bagging gives you diverse training sets. Feature subsampling gives you diverse trees.
The Random Forest Algorithm
Random Forest
Given training data with :
- For :
- Draw a bootstrap sample of size with replacement
- Grow a tree on , where at each node:
- Select features uniformly at random from the available features
- Find the best split among only those features
- Split and recurse until a stopping condition is met
- Predict by averaging (regression) or majority vote (classification):
Typical choices: for classification, for regression.
Why Feature Subsampling Works
Random Forest Variance Reduction
Statement
The variance of the random forest prediction is:
where is the average pairwise correlation between individual tree predictions and is the average variance of a single tree.
Feature subsampling reduces compared to pure bagging. As , the second term vanishes and the ensemble variance converges to .
Intuition
Pure bagging produces trees that all split on the same strong features, making large. Feature subsampling forces each tree to explore different features, decorrelating their predictions. This reduces at the cost of slightly increasing individual tree variance (since each tree uses a suboptimal subset of features). But the net effect is positive: the reduction in more than compensates for the increase in .
When (extreme subsampling), trees are nearly uncorrelated but each tree is noisy. When (no subsampling), you get pure bagging with higher correlation. The optimal balances decorrelation against individual tree quality.
Proof Sketch
Let be the tree predictions. Each has variance and for .
Rearranging: .
Why It Matters
This formula is the theoretical foundation of random forests. It explains every design choice: (1) grow deep trees to keep bias low and manageable, (2) use feature subsampling to reduce , (3) use many trees to shrink the term. It also explains why random forests cannot overfit by adding more trees: increasing only reduces the second term, never increases variance.
Failure Mode
The formula assumes identical variances and uniform pairwise correlation. In practice, some trees are better than others, and correlations vary. If a single feature dominates even after subsampling ( is too large), stays high and the forest gains little from averaging. Also, feature subsampling does not reduce bias. IF the relevant features are rarely selected, bias increases.
Out-of-Bag Estimation
Out-of-Bag (OOB) Error
Each bootstrap sample includes roughly 63.2% of the original training data. The remaining 36.8% is out-of-bag (OOB) for that tree.
For each training point , collect predictions only from trees where was OOB (not in the bootstrap sample). Average these predictions to form the OOB prediction . The OOB error is:
where is the loss function (0-1 loss for classification, squared error for regression).
The OOB error is an unbiased estimate of the test error. It is leave-one-out cross-validation for free: each point is evaluated by trees that never saw it during training. This eliminates the need for a separate validation set, which is especially valuable when data is scarce.
Variable Importance
Permutation Importance
Permutation Importance
To measure the importance of feature :
- Compute the OOB error of the forest:
- For each tree , permute the values of feature in its OOB data
- Compute the OOB error with the permuted feature:
- Importance of feature is:
If feature is important, permuting it destroys its predictive signal and increases error substantially.
Permutation importance measures how much the model relies on each feature. It is model-agnostic (can be applied to any model) and respects feature interactions.
Impurity-Based Importance
Impurity-Based Importance (MDI)
The Mean Decrease in Impurity for feature sums the impurity reduction from all splits on feature across all trees:
where is the proportion of samples reaching node and is the impurity decrease at that split.
Impurity importance is fast (computed during training) but biased: it favors high-cardinality features and features with many possible split points. A random feature with many unique values will show nonzero impurity importance simply because it can create arbitrary splits. Permutation importance does not have this bias.
Consistency Results
Random Forest Consistency
Statement
Under regularity conditions, random forests are consistent: as the sample size , the random forest prediction converges in probability to the true regression function:
This holds when (1) each tree is grown with sufficient depth (increasing with ) so that bias vanishes, and (2) the subsampling rate ensures each tree sees enough data to control variance.
Intuition
Consistency means the forest gets arbitrarily good with enough data. Deep trees have low bias (they can approximate any function), and averaging many decorrelated trees controls variance. As grows, each tree becomes more accurate and the average converges to the truth.
Proof Sketch
The proof by Scornet, Biau, and Vert (2015) for additive models proceeds by: (1) showing that each tree partition refines sufficiently as grows, so the bias of each tree vanishes; (2) using the forest averaging to control variance via the decorrelation bound; (3) combining bias and variance convergence. Earlier results by Biau (2012) proved consistency for simplified random forests with purely random splits.
Why It Matters
Consistency is the minimum theoretical requirement for any learning method. It guarantees that with enough data, the method converges to the truth. Not all popular methods are proven consistent; the fact that random forests have consistency guarantees (under conditions) gives them a theoretical foundation beyond their empirical success.
Failure Mode
Consistency is an asymptotic result. It says nothing about the rate of convergence. In high dimensions, random forests can converge slowly because the trees partition a high-dimensional space. The minimax rate for Lipschitz functions is , which is slow for large . In practice, random forests often beat this rate because real data has lower effective dimensionality.
Strengths and Weaknesses
Strengths:
- Robust to outliers and noise: averaging many trees smooths out noise
- Handles mixed feature types (numeric, categorical) without preprocessing
- Few hyperparameters: (features per split), (number of trees), and minimum leaf size are the main knobs
- Built-in variable importance and OOB error estimation
- Embarrassingly parallel: each tree is trained independently
- Hard to overfit by adding more trees (variance only decreases with )
Weaknesses:
- Slow prediction: making a prediction requires traversing trees
- Limited extrapolation: trees can only predict within the range of training targets (a forest trained on values in cannot predict 150)
- Memory-heavy: storing full trees can require significant memory
- Not competitive with gradient boosting on many tabular benchmarks (boosting reduces bias more effectively)
- No built-in uncertainty quantification beyond OOB variance
Common Confusions
Random forests CAN overfit in some sense
Adding more trees does not cause overfitting (variance only decreases). But growing trees too deep with too few samples per leaf can overfit individual trees. The ensemble averages out some of this overfitting, but not all of it. With very noisy data and very deep trees, the irreducible correlation term can be large.
Feature importance does not imply causation
A feature can have high importance because it is correlated with a causal feature, not because it is causal itself. Permutation importance measures predictive relevance in the model, not causal effect. Two highly correlated features may split importance between them, making both appear less important than they are.
OOB error is not the same as test error on truly new data
OOB error is a valid estimate of generalization error under the assumption that future data comes from the same distribution. If the test distribution differs (covariate shift, temporal drift), OOB error can be optimistic.
Summary
- Random forest = bagging + feature subsampling at each split
- Feature subsampling reduces pairwise correlation between trees, which is the key to variance reduction beyond pure bagging
- Ensemble variance = ; decorrelation ( small) matters more than adding trees ( large)
- OOB error provides a free estimate of test error
- Permutation importance is unbiased; impurity importance is biased toward high-cardinality features
- Random forests are consistent: they converge to the true function as under regularity conditions
- Strengths: robust, few hyperparameters, parallel. Weaknesses: slow prediction, no extrapolation, outperformed by boosting on many tasks
Exercises
Problem
A dataset has features. You build a random forest for classification with the default . At each split, each tree considers only 10 out of 100 features. If one feature is very strong, what is the probability it is available at a given split? How does this create diversity?
Problem
You have two random forests: Forest A uses (extremely random trees) and Forest B uses (pure bagging, no feature subsampling). Both use trees. Using the variance formula , explain which forest has lower variance and under what conditions the other forest might win.
Problem
Random forests cannot extrapolate: a forest trained on targets in can never predict 150 because leaf predictions are averages of training targets. Propose a modification that enables extrapolation, and discuss what theoretical properties it might lose.
References
Canonical:
- Breiman, "Random Forests" (2001). The original paper
- Hastie, Tibshirani, Friedman, Elements of Statistical Learning, Chapter 15
Current:
-
Scornet, Biau, Vert, "Consistency of Random Forests" (Annals of Statistics, 2015)
-
Strobl et al., "Bias in Random Forest Variable Importance Measures" (2007)
-
Biau & Scornet, "A Random Forest Guided Tour" (TEST, 2016)
-
Bishop, Pattern Recognition and Machine Learning (2006), Chapters 1-14
Next Topics
The natural next steps from random forests:
- Gradient boosting: the complementary ensemble approach that reduces bias rather than variance
- Cross-validation theory: alternative ways to estimate generalization error
Last reviewed: April 2026
Prerequisites
Foundations this topic depends on.
- Decision Trees and EnsemblesLayer 2
- Empirical Risk MinimizationLayer 2
- Concentration InequalitiesLayer 1
- Common Probability DistributionsLayer 0A
- Sets, Functions, and RelationsLayer 0A
- Basic Logic and Proof TechniquesLayer 0A
- Expectation, Variance, Covariance, and MomentsLayer 0A
- Bias-Variance TradeoffLayer 2
- Bootstrap MethodsLayer 2