Random Forests

Sneiderman, Robby

ML Methods

Random Forests

Random forests combine bagging with random feature subsampling to decorrelate trees, reducing ensemble variance beyond what pure bagging achieves. Out-of-bag estimation, variable importance, consistency theory, and practical strengths and weaknesses.

CoreTier 1StableSupporting~45 min

Prerequisites

Decision Trees and Ensembles Bootstrap Methods Bagging

Start 8-question practice · 2 available Prereq Map

Learning position

Read this page in the graph.

ml-methods | layer 2 | tier 1. This page has 3 direct prerequisites and 0 published dependents.

Open Atlas Prerequisites Leads to

What next

Gradient Boosting

This is the first curated or graph-derived continuation from the current page.

Evidence badge

Source-grounded page

This page has no public Lean mapping yet. Use the evidence page to inspect how claim status labels work.

Show the backing system

AtlasOpen the full prerequisite graph and run grounding traces.EvidenceInspect source support, claim labels, and public trust status.LeanReview the checked declaration list, scopes, and axiom profile.

Why This Matters

Random forests are one of the most reliable ensemble methods in practice. On tabular data, they consistently perform well with minimal tuning. They handle mixed feature types, are robust to outliers, require few hyperparameters, and provide built-in variable importance measures. Understanding why they work — not just that they work — requires understanding how feature subsampling decorrelates decision trees and why this reduces ensemble variance.

Mental Model

A single deep decision tree memorizes noise (high variance, low bias). Bagging reduces variance by averaging many trees trained on bootstrap samples. But bagged trees are correlated: if one feature is strongly predictive, all trees split on it first, producing similar trees. Feature subsampling forces each tree to ignore most features at each split, creating diverse trees that make different errors. The errors cancel when you average.

Random forests = bagging + feature subsampling. Bagging gives you diverse training sets. Feature subsampling gives you diverse trees.

The Random Forest Algorithm

Definition

Random Forest

Given training data $\{(x_i, y_i)\}_{i=1}^n$ with $x_i \in \mathbb{R}^d$ :

For $b = 1, \ldots, B$ $b = 1, \dots, B$ :
- Draw a bootstrap sample $S_b$ of size $n$ with replacement
- Grow a tree $T_b$ $T_{b}$ on $S_b$ $S_{b}$ , where at each node:
  - Select $m$ features uniformly at random from the $d$ available features
  - Find the best split among only those $m$ features
  - Split and recurse until a stopping condition is met
Predict by averaging (regression) or majority vote (classification):

$\hat{f}_{\text{RF}}(x) = \frac{1}{B} \sum_{b=1}^{B} T_b(x)$

Typical choices: $m = \lfloor\sqrt{d}\rfloor$ for classification, $m = \lfloor d/3 \rfloor$ for regression.

Why Feature Subsampling Works

Proposition

Random Forest Variance Reduction

Statement

The variance of the random forest prediction is:

$\text{Var}(\hat{f}_{\text{RF}}) = \rho \sigma^2 + \frac{1 - \rho}{B} \sigma^2$

where $\rho$ is the average pairwise correlation between individual tree predictions and $\sigma^2$ is the average variance of a single tree.

Feature subsampling reduces $\rho$ compared to pure bagging. As $B \to \infty$ , the second term vanishes and the ensemble variance converges to $\rho\sigma^2$ .

Intuition

Pure bagging produces trees that all split on the same strong features, making $\rho$ large. Feature subsampling forces each tree to explore different features, decorrelating their predictions. This reduces $\rho$ at the cost of slightly increasing individual tree variance $\sigma^2$ (since each tree uses a suboptimal subset of features). But the net effect is positive: the reduction in $\rho$ more than compensates for the increase in $\sigma^2$ .

When $m = 1$ (extreme subsampling), trees are nearly uncorrelated but each tree is noisy. When $m = d$ (no subsampling), you get pure bagging with higher correlation. The optimal $m$ balances decorrelation against individual tree quality.

Proof Sketch

Let $T_1(x), \ldots, T_B(x)$ be the tree predictions. Each has variance $\sigma^2$ and $\text{Cov}(T_b, T_{b'}) = \rho \sigma^2$ for $b \neq b'$ .

$\text{Var}\left(\frac{1}{B}\sum_{b=1}^{B} T_b\right) = \frac{1}{B^2}\left[B\sigma^2 + B(B-1)\rho\sigma^2\right] = \frac{\sigma^2}{B} + \frac{B-1}{B}\rho\sigma^2$

Rearranging: $\rho\sigma^2 + \frac{(1-\rho)\sigma^2}{B}$ .

Why It Matters

This formula is the theoretical foundation of random forests. It explains every design choice: (1) grow deep trees to keep bias low and $\sigma^2$ manageable, (2) use feature subsampling to reduce $\rho$ , (3) use many trees to shrink the $(1-\rho)/B$ term. It also explains why random forests cannot overfit by adding more trees: increasing $B$ only reduces the second term, never increases variance.

Failure Mode

The formula assumes identical variances and uniform pairwise correlation. In practice, some trees are better than others, and correlations vary. If a single feature dominates even after subsampling ( $m$ is too large), $\rho$ stays high and the forest gains little from averaging. Also, feature subsampling does not reduce bias. IF the relevant features are rarely selected, bias increases.

report a correction →

Out-of-Bag Estimation

Definition

Out-of-Bag (OOB) Error

Each bootstrap sample includes roughly 63.2% of the original training data. The remaining 36.8% is out-of-bag (OOB) for that tree.

For each training point $x_i$ , collect predictions only from trees where $x_i$ was OOB (not in the bootstrap sample). Average these predictions to form the OOB prediction $\hat{y}_i^{\text{OOB}}$ . The OOB error is:

$\text{OOB Error} = \frac{1}{n} \sum_{i=1}^{n} L(y_i, \hat{y}_i^{\text{OOB}})$

where $L$ is the loss function (0-1 loss for classification, squared error for regression).

The OOB error is approximately an unbiased estimate of the generalization error of a smaller forest — specifically, of an ensemble of size $\approx 0.368 B$ , since each training point is scored only by the $\approx 36.8\%$ of trees that did not see it. It is not exactly leave-one-out cross-validation: in LOO-CV every fold uses an ensemble of the full target size minus one example, while in OOB each fold uses a different (smaller, random) subset of trees. For large $B$ the gap is small, but for small $B$ OOB error has slightly higher variance and a small upward bias relative to the test error of the full forest. With the default $B \geq 500$ this distinction is rarely material in practice, but it matters for theoretical claims and for very small forests.

Variable Importance

Permutation Importance

Definition

Permutation Importance

To measure the importance of feature $j$ :

Compute the OOB error of the forest: $\text{Err}^{\text{OOB}}$
For each tree $T_b$ , permute the values of feature $j$ in its OOB data
Compute the OOB error with the permuted feature: $\text{Err}_j^{\text{perm}}$
Importance of feature $j$ is:

$\text{VI}_j = \frac{1}{B} \sum_{b=1}^{B} \left(\text{Err}_{j,b}^{\text{perm}} - \text{Err}_b^{\text{OOB}}\right)$

If feature $j$ is important, permuting it destroys its predictive signal and increases error substantially.

Permutation importance measures how much the model relies on each feature. It is model-agnostic (can be applied to any model) and respects feature interactions.

Impurity-Based Importance

Definition

Impurity-Based Importance (MDI)

The Mean Decrease in Impurity for feature $j$ sums the impurity reduction from all splits on feature $j$ across all trees:

$\text{MDI}_j = \frac{1}{B} \sum_{b=1}^{B} \sum_{\text{node } t \in T_b \text{ splits on } j} p(t) \cdot \Delta I(t)$

where $p(t)$ is the proportion of samples reaching node $t$ and $\Delta I(t)$ is the impurity decrease at that split.

Impurity importance is fast (computed during training) but biased: it favors high-cardinality features and features with many possible split points. A random feature with many unique values will show nonzero impurity importance simply because it can create arbitrary splits. Permutation importance does not have this bias.

Consistency Results

Theorem

Random Forest Consistency

Statement

Under regularity conditions, random forests are consistent: as the sample size $n \to \infty$ , the random forest prediction converges in probability to the true regression function:

$\hat{f}_{\text{RF}}(x) \xrightarrow{P} f(x) = \mathbb{E}[Y \mid X = x]$

This holds when (1) each tree is grown with sufficient depth (increasing with $n$ ) so that bias vanishes, and (2) the subsampling rate ensures each tree sees enough data to control variance.

Intuition

Consistency means the forest gets arbitrarily good with enough data. Deep trees have low bias (they can approximate any function), and averaging many decorrelated trees controls variance. As $n$ grows, each tree becomes more accurate and the average converges to the truth.

Proof Sketch

The proof by Scornet, Biau, and Vert (2015) for additive models proceeds by: (1) showing that each tree partition refines sufficiently as $n$ grows, so the bias of each tree vanishes; (2) using the forest averaging to control variance via the decorrelation bound; (3) combining bias and variance convergence. Earlier results by Biau (2012) proved consistency for simplified random forests with purely random splits.

Why It Matters

Consistency is the minimum theoretical requirement for any learning method. It guarantees that with enough data, the method converges to the truth. Not all popular methods are proven consistent; the fact that random forests have consistency guarantees (under conditions) gives them a theoretical foundation beyond their empirical success.

Failure Mode

Consistency is an asymptotic result. It says nothing about the rate of convergence. In high dimensions, random forests can converge slowly because the trees partition a high-dimensional space. The minimax rate for Lipschitz functions is $O(n^{-2/(2+d)})$ , which is slow for large $d$ . In practice, random forests often beat this rate because real data has lower effective dimensionality.

report a correction →

Strengths and Weaknesses

Strengths:

Robust to outliers and noise: averaging many trees smooths out noise
Handles mixed feature types (numeric, categorical) without preprocessing
Few hyperparameters: $m$ (features per split), $B$ (number of trees), and minimum leaf size are the main knobs
Built-in variable importance and OOB error estimation
Embarrassingly parallel: each tree is trained independently
Hard to overfit by adding more trees (variance only decreases with $B$ )

Weaknesses:

Slow prediction: making a prediction requires traversing $B$ trees
Limited extrapolation: trees can only predict within the range of training targets (a forest trained on values in $[0, 100]$ cannot predict 150)
Memory-heavy: storing $B$ full trees can require significant memory
Not competitive with gradient boosting on many tabular benchmarks (boosting reduces bias more effectively)
No built-in uncertainty quantification beyond OOB variance

Common Confusions

Watch Out

Random forests CAN overfit in some sense

Adding more trees does not cause overfitting (variance only decreases). But growing trees too deep with too few samples per leaf can overfit individual trees. The ensemble averages out some of this overfitting, but not all of it. With very noisy data and very deep trees, the irreducible correlation term $\rho\sigma^2$ can be large.

Watch Out

Feature importance does not imply causation

A feature can have high importance because it is correlated with a causal feature, not because it is causal itself. Permutation importance measures predictive relevance in the model, not causal effect. Two highly correlated features may split importance between them, making both appear less important than they are.

Watch Out

OOB error is not the same as test error on truly new data

OOB error is a valid estimate of generalization error under the assumption that future data comes from the same distribution. If the test distribution differs (covariate shift, temporal drift), OOB error can be optimistic.

Summary

Random forest = bagging + feature subsampling at each split
Feature subsampling reduces pairwise correlation $\rho$ between trees, which is the key to variance reduction beyond pure bagging
Ensemble variance = $\rho\sigma^2 + (1-\rho)\sigma^2/B$ ; decorrelation ( $\rho$ small) matters more than adding trees ( $B$ large)
OOB error provides a near-unbiased estimate of test error for a forest of size $\approx 0.368 B$ — close to LOO-CV in spirit but not identical
Permutation importance is unbiased; impurity importance is biased toward high-cardinality features
Random forests are consistent: they converge to the true function as $n \to \infty$ under regularity conditions
Strengths: robust, few hyperparameters, parallel. Weaknesses: slow prediction, no extrapolation, outperformed by boosting on many tasks

Optional Deeper DetailRandom forests as adaptive nearest-neighbor smoothersShow

Advanced section adapted from Hastie, Tibshirani, Friedman, The Elements of Statistical Learning (2nd ed., Springer 2009), §15.4.3 "Random Forests and Overfitting" and the broader §15.4 discussion, pp. 596-600, plus Lin and Jeon (2006), "Random Forests and Adaptive Nearest Neighbors," JASA 101(474), 578-590.

Random forests look like ensembles of trees, but there is a precise sense in which a trained random forest is a weighted nearest-neighbor predictor with weights learned from the data. This perspective bridges trees to KNN and explains why forests perform well even when individual trees are weak.

The weight representation. For a regression forest $\hat f(x) = \frac{1}{B}\sum_{b=1}^B T_b(x)$ , where tree $T_b$ has terminal regions $R_b^1, R_b^2, \ldots$ , the prediction at a query $x$ is

$\hat f(x) \;=\; \frac{1}{B} \sum_{b=1}^B \frac{1}{|R_b(x)|} \sum_{i \in R_b(x)} y_i$

where $R_b(x)$ is the terminal region of tree $b$ containing $x$ and $|R_b(x)|$ is the number of training points in that region. Rearranging:

$\hat f(x) \;=\; \sum_{i=1}^n w_i(x) \, y_i, \qquad w_i(x) \;=\; \frac{1}{B} \sum_{b=1}^B \frac{\mathbb 1\{x_i \in R_b(x)\}}{|R_b(x)|}.$

The forest prediction is a weighted average of training labels, with weights that depend on $x$ and the data. Two training points have similar weights when they fall in the same terminal regions across many trees: that is, when they are "close" in the data-driven, axis-aligned sense the forest has learned.

The KNN connection. Let $N(x) = \{i : w_i(x) > 0\}$ be the "neighborhood" of $x$ . Lin and Jeon (2006) show that random forests are a specific instance of a potential nearest-neighbor method: for any candidate set $N$ around $x$ , there exists a tree that selects $N$ as its terminal region. The forest weights then average across these neighborhoods, picking up only those that are robust to the randomization (bootstrap + feature subsampling).

Why this matters. The KNN connection gives a clean account of three forest properties that are otherwise mysterious:

Consistency follows from the same arguments as KNN consistency, provided the weights $w_i(x)$ concentrate on points near $x$ as $n \to \infty$ . ESL §15.4.4 traces this through.
Variable importance has a natural interpretation: a feature is important if removing it changes the learned neighborhoods $R_b(x)$ substantially.
No extrapolation follows immediately from the weighted-average form: $\hat f(x)$ is a convex combination of training $y_i$ , so it lies in $[\min_i y_i, \max_i y_i]$ .

Comparison with vanilla KNN. Vanilla KNN uses Euclidean (or fixed) distance. Random forests learn an adaptive, non-isotropic distance from the data. In high dimensions where vanilla KNN fails (curse of dimensionality), the forest can still discover that only a few features matter and adapt its neighborhoods accordingly. This is also what the DANN algorithm (see knn Advanced section) tries to do directly: forests do it implicitly through tree structure.

Caveat on extrapolation. The same weighted-average representation makes extrapolation impossible: a forest cannot predict beyond the range of training $y_i$ because every prediction is a convex combination of them. This is a feature, not a bug, when stability matters more than reach; it is a bug when the underlying function genuinely extrapolates linearly (e.g., a physical process with known growth rate).

Exercises

ExerciseCore

Problem

A dataset has $d = 100$ features. You build a random forest for classification with the default $m = \lfloor\sqrt{d}\rfloor = 10$ . At each split, each tree considers only 10 out of 100 features. If one feature is very strong, what is the probability it is available at a given split? How does this create diversity?

ExerciseAdvanced

Problem

You have two random forests: Forest A uses $m = 1$ (extremely random trees) and Forest B uses $m = d$ (pure bagging, no feature subsampling). Both use $B = 500$ trees. Using the variance formula $\text{Var} = \rho\sigma^2 + (1-\rho)\sigma^2/B$ , explain which forest has lower variance and under what conditions the other forest might win.

ExerciseResearch

Problem

Random forests cannot extrapolate: a forest trained on targets in $[0, 100]$ can never predict 150 because leaf predictions are averages of training targets. Propose a modification that enables extrapolation, and discuss what theoretical properties it might lose.

References

Canonical:

Breiman, "Random Forests" (2001). The original paper
Hastie, T., Tibshirani, R., and Friedman, J. (2009). The Elements of Statistical Learning, 2nd ed. Springer. §15 (random forests), §15.2 (algorithm and OOB), §15.3 (variable importance), §15.4 (analysis of random forests: bias, variance, the adaptive-NN view), §15.4.4 (consistency and overfitting).
Lin, Y., and Jeon, Y. (2006). "Random Forests and Adaptive Nearest Neighbors." JASA 101(474), 578-590. The formal random-forest-as-weighted-KNN result.

Current:

Scornet, Biau, Vert, "Consistency of Random Forests" (Annals of Statistics, 2015)
Strobl et al., "Bias in Random Forest Variable Importance Measures" (2007)
Biau & Scornet, "A Random Forest Guided Tour" (TEST, 2016)

Next Topics

The natural next steps from random forests:

Gradient boosting: the complementary ensemble approach that reduces bias rather than variance
Cross-validation theory: alternative ways to estimate generalization error

Last reviewed: April 26, 2026

Canonical graph

Required before and derived from this topic

These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.

Full prerequisite chain All derived topics

Required prerequisites

3

Bagginglayer 2 · tier 1
Bootstrap Methodslayer 2 · tier 1
Decision Trees and Ensembleslayer 2 · tier 2

Derived topics

0

No published topic currently declares this as a prerequisite.