Beta. Content is under active construction and has not been peer-reviewed. Report errors on GitHub.Disclaimer

ML Methods

Random Forests

Random forests combine bagging with random feature subsampling to decorrelate trees, reducing ensemble variance beyond what pure bagging achieves. Out-of-bag estimation, variable importance, consistency theory, and practical strengths and weaknesses.

CoreTier 1Stable~45 min

Why This Matters

Training data (n samples, d features)bootstrapbootstrapbootstrapbootstrapbootstrapTree 1features: f1,f3,f7catTree 2features: f2,f5,f8dogTree 3features: f1,f4,f6catTree 4features: f3,f5,f9catTree 5features: f2,f7,f8dogMajority vote:cat (3/5)

Random forests are one of the most reliable ensemble methods in practice. On tabular data, they consistently perform well with minimal tuning. They handle mixed feature types, are robust to outliers, require few hyperparameters, and provide built-in variable importance measures. Understanding why they work. Not just that they work. requires understanding how feature subsampling decorrelates decision trees and why this reduces ensemble variance.

Mental Model

A single deep decision tree memorizes noise (high variance, low bias). Bagging reduces variance by averaging many trees trained on bootstrap samples. But bagged trees are correlated: if one feature is strongly predictive, all trees split on it first, producing similar trees. Feature subsampling forces each tree to ignore most features at each split, creating diverse trees that make different errors. The errors cancel when you average.

Random forests = bagging + feature subsampling. Bagging gives you diverse training sets. Feature subsampling gives you diverse trees.

The Random Forest Algorithm

Definition

Random Forest

Given training data {(xi,yi)}i=1n\{(x_i, y_i)\}_{i=1}^n with xiRdx_i \in \mathbb{R}^d:

  1. For b=1,,Bb = 1, \ldots, B:
    • Draw a bootstrap sample SbS_b of size nn with replacement
    • Grow a tree TbT_b on SbS_b, where at each node:
      • Select mm features uniformly at random from the dd available features
      • Find the best split among only those mm features
      • Split and recurse until a stopping condition is met
  2. Predict by averaging (regression) or majority vote (classification):

f^RF(x)=1Bb=1BTb(x)\hat{f}_{\text{RF}}(x) = \frac{1}{B} \sum_{b=1}^{B} T_b(x)

Typical choices: m=dm = \lfloor\sqrt{d}\rfloor for classification, m=d/3m = \lfloor d/3 \rfloor for regression.

Why Feature Subsampling Works

Proposition

Random Forest Variance Reduction

Statement

The variance of the random forest prediction is:

Var(f^RF)=ρσ2+1ρBσ2\text{Var}(\hat{f}_{\text{RF}}) = \rho \sigma^2 + \frac{1 - \rho}{B} \sigma^2

where ρ\rho is the average pairwise correlation between individual tree predictions and σ2\sigma^2 is the average variance of a single tree.

Feature subsampling reduces ρ\rho compared to pure bagging. As BB \to \infty, the second term vanishes and the ensemble variance converges to ρσ2\rho\sigma^2.

Intuition

Pure bagging produces trees that all split on the same strong features, making ρ\rho large. Feature subsampling forces each tree to explore different features, decorrelating their predictions. This reduces ρ\rho at the cost of slightly increasing individual tree variance σ2\sigma^2 (since each tree uses a suboptimal subset of features). But the net effect is positive: the reduction in ρ\rho more than compensates for the increase in σ2\sigma^2.

When m=1m = 1 (extreme subsampling), trees are nearly uncorrelated but each tree is noisy. When m=dm = d (no subsampling), you get pure bagging with higher correlation. The optimal mm balances decorrelation against individual tree quality.

Proof Sketch

Let T1(x),,TB(x)T_1(x), \ldots, T_B(x) be the tree predictions. Each has variance σ2\sigma^2 and Cov(Tb,Tb)=ρσ2\text{Cov}(T_b, T_{b'}) = \rho \sigma^2 for bbb \neq b'.

Var(1Bb=1BTb)=1B2[Bσ2+B(B1)ρσ2]=σ2B+B1Bρσ2\text{Var}\left(\frac{1}{B}\sum_{b=1}^{B} T_b\right) = \frac{1}{B^2}\left[B\sigma^2 + B(B-1)\rho\sigma^2\right] = \frac{\sigma^2}{B} + \frac{B-1}{B}\rho\sigma^2

Rearranging: ρσ2+(1ρ)σ2B\rho\sigma^2 + \frac{(1-\rho)\sigma^2}{B}.

Why It Matters

This formula is the theoretical foundation of random forests. It explains every design choice: (1) grow deep trees to keep bias low and σ2\sigma^2 manageable, (2) use feature subsampling to reduce ρ\rho, (3) use many trees to shrink the (1ρ)/B(1-\rho)/B term. It also explains why random forests cannot overfit by adding more trees: increasing BB only reduces the second term, never increases variance.

Failure Mode

The formula assumes identical variances and uniform pairwise correlation. In practice, some trees are better than others, and correlations vary. If a single feature dominates even after subsampling (mm is too large), ρ\rho stays high and the forest gains little from averaging. Also, feature subsampling does not reduce bias. IF the relevant features are rarely selected, bias increases.

Out-of-Bag Estimation

Definition

Out-of-Bag (OOB) Error

Each bootstrap sample includes roughly 63.2% of the original training data. The remaining 36.8% is out-of-bag (OOB) for that tree.

For each training point xix_i, collect predictions only from trees where xix_i was OOB (not in the bootstrap sample). Average these predictions to form the OOB prediction y^iOOB\hat{y}_i^{\text{OOB}}. The OOB error is:

OOB Error=1ni=1nL(yi,y^iOOB)\text{OOB Error} = \frac{1}{n} \sum_{i=1}^{n} L(y_i, \hat{y}_i^{\text{OOB}})

where LL is the loss function (0-1 loss for classification, squared error for regression).

The OOB error is an unbiased estimate of the test error. It is leave-one-out cross-validation for free: each point is evaluated by trees that never saw it during training. This eliminates the need for a separate validation set, which is especially valuable when data is scarce.

Variable Importance

Permutation Importance

Definition

Permutation Importance

To measure the importance of feature jj:

  1. Compute the OOB error of the forest: ErrOOB\text{Err}^{\text{OOB}}
  2. For each tree TbT_b, permute the values of feature jj in its OOB data
  3. Compute the OOB error with the permuted feature: Errjperm\text{Err}_j^{\text{perm}}
  4. Importance of feature jj is:

VIj=1Bb=1B(Errj,bpermErrbOOB)\text{VI}_j = \frac{1}{B} \sum_{b=1}^{B} \left(\text{Err}_{j,b}^{\text{perm}} - \text{Err}_b^{\text{OOB}}\right)

If feature jj is important, permuting it destroys its predictive signal and increases error substantially.

Permutation importance measures how much the model relies on each feature. It is model-agnostic (can be applied to any model) and respects feature interactions.

Impurity-Based Importance

Definition

Impurity-Based Importance (MDI)

The Mean Decrease in Impurity for feature jj sums the impurity reduction from all splits on feature jj across all trees:

MDIj=1Bb=1Bnode tTb splits on jp(t)ΔI(t)\text{MDI}_j = \frac{1}{B} \sum_{b=1}^{B} \sum_{\text{node } t \in T_b \text{ splits on } j} p(t) \cdot \Delta I(t)

where p(t)p(t) is the proportion of samples reaching node tt and ΔI(t)\Delta I(t) is the impurity decrease at that split.

Impurity importance is fast (computed during training) but biased: it favors high-cardinality features and features with many possible split points. A random feature with many unique values will show nonzero impurity importance simply because it can create arbitrary splits. Permutation importance does not have this bias.

Consistency Results

Theorem

Random Forest Consistency

Statement

Under regularity conditions, random forests are consistent: as the sample size nn \to \infty, the random forest prediction converges in probability to the true regression function:

f^RF(x)Pf(x)=E[YX=x]\hat{f}_{\text{RF}}(x) \xrightarrow{P} f(x) = \mathbb{E}[Y \mid X = x]

This holds when (1) each tree is grown with sufficient depth (increasing with nn) so that bias vanishes, and (2) the subsampling rate ensures each tree sees enough data to control variance.

Intuition

Consistency means the forest gets arbitrarily good with enough data. Deep trees have low bias (they can approximate any function), and averaging many decorrelated trees controls variance. As nn grows, each tree becomes more accurate and the average converges to the truth.

Proof Sketch

The proof by Scornet, Biau, and Vert (2015) for additive models proceeds by: (1) showing that each tree partition refines sufficiently as nn grows, so the bias of each tree vanishes; (2) using the forest averaging to control variance via the decorrelation bound; (3) combining bias and variance convergence. Earlier results by Biau (2012) proved consistency for simplified random forests with purely random splits.

Why It Matters

Consistency is the minimum theoretical requirement for any learning method. It guarantees that with enough data, the method converges to the truth. Not all popular methods are proven consistent; the fact that random forests have consistency guarantees (under conditions) gives them a theoretical foundation beyond their empirical success.

Failure Mode

Consistency is an asymptotic result. It says nothing about the rate of convergence. In high dimensions, random forests can converge slowly because the trees partition a high-dimensional space. The minimax rate for Lipschitz functions is O(n2/(2+d))O(n^{-2/(2+d)}), which is slow for large dd. In practice, random forests often beat this rate because real data has lower effective dimensionality.

Strengths and Weaknesses

Strengths:

  • Robust to outliers and noise: averaging many trees smooths out noise
  • Handles mixed feature types (numeric, categorical) without preprocessing
  • Few hyperparameters: mm (features per split), BB (number of trees), and minimum leaf size are the main knobs
  • Built-in variable importance and OOB error estimation
  • Embarrassingly parallel: each tree is trained independently
  • Hard to overfit by adding more trees (variance only decreases with BB)

Weaknesses:

  • Slow prediction: making a prediction requires traversing BB trees
  • Limited extrapolation: trees can only predict within the range of training targets (a forest trained on values in [0,100][0, 100] cannot predict 150)
  • Memory-heavy: storing BB full trees can require significant memory
  • Not competitive with gradient boosting on many tabular benchmarks (boosting reduces bias more effectively)
  • No built-in uncertainty quantification beyond OOB variance

Common Confusions

Watch Out

Random forests CAN overfit in some sense

Adding more trees does not cause overfitting (variance only decreases). But growing trees too deep with too few samples per leaf can overfit individual trees. The ensemble averages out some of this overfitting, but not all of it. With very noisy data and very deep trees, the irreducible correlation term ρσ2\rho\sigma^2 can be large.

Watch Out

Feature importance does not imply causation

A feature can have high importance because it is correlated with a causal feature, not because it is causal itself. Permutation importance measures predictive relevance in the model, not causal effect. Two highly correlated features may split importance between them, making both appear less important than they are.

Watch Out

OOB error is not the same as test error on truly new data

OOB error is a valid estimate of generalization error under the assumption that future data comes from the same distribution. If the test distribution differs (covariate shift, temporal drift), OOB error can be optimistic.

Summary

  • Random forest = bagging + feature subsampling at each split
  • Feature subsampling reduces pairwise correlation ρ\rho between trees, which is the key to variance reduction beyond pure bagging
  • Ensemble variance = ρσ2+(1ρ)σ2/B\rho\sigma^2 + (1-\rho)\sigma^2/B; decorrelation (ρ\rho small) matters more than adding trees (BB large)
  • OOB error provides a free estimate of test error
  • Permutation importance is unbiased; impurity importance is biased toward high-cardinality features
  • Random forests are consistent: they converge to the true function as nn \to \infty under regularity conditions
  • Strengths: robust, few hyperparameters, parallel. Weaknesses: slow prediction, no extrapolation, outperformed by boosting on many tasks

Exercises

ExerciseCore

Problem

A dataset has d=100d = 100 features. You build a random forest for classification with the default m=d=10m = \lfloor\sqrt{d}\rfloor = 10. At each split, each tree considers only 10 out of 100 features. If one feature is very strong, what is the probability it is available at a given split? How does this create diversity?

ExerciseAdvanced

Problem

You have two random forests: Forest A uses m=1m = 1 (extremely random trees) and Forest B uses m=dm = d (pure bagging, no feature subsampling). Both use B=500B = 500 trees. Using the variance formula Var=ρσ2+(1ρ)σ2/B\text{Var} = \rho\sigma^2 + (1-\rho)\sigma^2/B, explain which forest has lower variance and under what conditions the other forest might win.

ExerciseResearch

Problem

Random forests cannot extrapolate: a forest trained on targets in [0,100][0, 100] can never predict 150 because leaf predictions are averages of training targets. Propose a modification that enables extrapolation, and discuss what theoretical properties it might lose.

References

Canonical:

  • Breiman, "Random Forests" (2001). The original paper
  • Hastie, Tibshirani, Friedman, Elements of Statistical Learning, Chapter 15

Current:

  • Scornet, Biau, Vert, "Consistency of Random Forests" (Annals of Statistics, 2015)

  • Strobl et al., "Bias in Random Forest Variable Importance Measures" (2007)

  • Biau & Scornet, "A Random Forest Guided Tour" (TEST, 2016)

  • Bishop, Pattern Recognition and Machine Learning (2006), Chapters 1-14

Next Topics

The natural next steps from random forests:

  • Gradient boosting: the complementary ensemble approach that reduces bias rather than variance
  • Cross-validation theory: alternative ways to estimate generalization error

Last reviewed: April 2026

Prerequisites

Foundations this topic depends on.

Builds on This

Next Topics