Gradient Boosting

Sneiderman, Robby

ML Methods

Gradient Boosting

Gradient boosting as functional gradient descent: fit weak learners to pseudo-residuals sequentially, reducing bias at each round. Covers AdaBoost, shrinkage, XGBoost second-order methods, and LightGBM leaf-wise growth.

AdvancedTier 1StableSupporting~60 min

Prerequisites

Decision Trees and Ensembles Gradient Descent Variants Adaboost Cubist and Model Trees

Start 8-question practice · 1 available Prereq Map

Learning position

Read this page in the graph.

ml-methods | layer 2 | tier 1. This page has 5 direct prerequisites and 4 published dependents.

Open Atlas Prerequisites Leads to

What next

Regularization Theory

This is the first curated or graph-derived continuation from the current page.

Evidence badge

Source-grounded page

This page has no public Lean mapping yet. Use the evidence page to inspect how claim status labels work.

Show the backing system

AtlasOpen the full prerequisite graph and run grounding traces.EvidenceInspect source support, claim labels, and public trust status.LeanReview the checked declaration list, scopes, and axiom profile.

Why This Matters

Gradient-boosted decision trees are the dominant method for tabular data. XGBoost and LightGBM win most Kaggle competitions involving structured data, and they power production systems at scale. Understanding gradient boosting as functional gradient descent -- fitting each tree to the negative gradient of the loss -- gives you the theoretical foundation to understand shrinkage, regularization, and the differences between implementations.

More . boosting is the complement of bagging. Bagging (random forests) reduces variance by averaging independent high-variance learners. Boosting reduces bias by sequentially correcting the errors of weak learners. Understanding this bias-variance tradeoff is essential for choosing and tuning ensemble methods.

Mental Model

Imagine you have a bad predictor $F_0(x)$ . Its errors (residuals) form a pattern. You fit a small tree to those residuals. Adding that tree to $F_0$ partially corrects the errors. The remaining residuals form a new pattern. You fit another tree. Repeat. Each round reduces the bias of the ensemble by targeting whatever structure the current model is missing.

The key insight: this process is gradient descent in function space. The residuals are the negative gradient of the loss with respect to the current prediction. Each tree is one gradient step.

Formal Setup

Let $\{(x_i, y_i)\}_{i=1}^n$ be training data. We build an additive model:

$F_M(x) = F_0(x) + \sum_{m=1}^{M} \eta \, h_m(x)$

where $F_0$ is an initial prediction (often the mean of $y$ ), each $h_m$ is a decision tree, and $\eta \in (0, 1]$ is the learning rate (shrinkage).

Definition

Pseudo-Residuals $r_{i}^{(m)}$

At round $m$ , the pseudo-residual for sample $i$ is the negative gradient of the loss with respect to the current prediction:

$r_i^{(m)} = -\frac{\partial L(y_i, F_{m-1}(x_i))}{\partial F_{m-1}(x_i)}$

The new tree $h_m$ is fit to these pseudo-residuals. For squared loss $L(y, F) = \frac{1}{2}(y - F)^2$ , the pseudo-residual is simply $r_i = y_i - F_{m-1}(x_i)$ , the ordinary residual.

Main Theorems

Theorem

Gradient Boosting as Functional Gradient Descent

Statement

The gradient boosting update

$F_m(x) = F_{m-1}(x) + \eta \, h_m(x)$

where $h_m = \arg\min_{h \in \mathcal{H}} \sum_{i=1}^n (r_i^{(m)} - h(x_i))^2$ , is a steepest descent step in the function space $L^2$ , where the descent direction is projected onto the hypothesis class $\mathcal{H}$ .

Intuition

In ordinary gradient descent, you compute the gradient of a loss with respect to parameters and take a step. In gradient boosting, the "parameters" are the function values $F(x_1), \ldots, F(x_n)$ . The gradient of the total loss with respect to these function values is the vector of pseudo-residuals. Fitting a tree to the pseudo-residuals is projecting this gradient onto the space of trees -- the closest tree to the gradient direction. Adding it with step size $\eta$ is a gradient step in function space.

Proof Sketch

The empirical loss is $\mathcal{L}(F) = \sum_{i=1}^n L(y_i, F(x_i))$ . The functional gradient is $\nabla_F \mathcal{L} = (\partial L / \partial F(x_i))_{i=1}^n$ . The negative gradient is the vector of pseudo-residuals $(r_1, \ldots, r_n)$ . We cannot step in an arbitrary direction in function space (infinite-dimensional), so we project onto $\mathcal{H}$ by finding $h_m = \arg\min_h \sum_i (r_i - h(x_i))^2$ . Then $F_m = F_{m-1} + \eta h_m$ is a projected gradient descent step.

Why It Matters

This viewpoint unifies all boosting variants. Change the loss function and you get different pseudo-residuals: squared loss gives ordinary residuals, logistic loss gives probability-weighted residuals, quantile loss gives sign-based residuals. The framework extends to any differentiable loss.

Failure Mode

The projection onto $\mathcal{H}$ is only approximate. If the tree class cannot represent the true gradient direction, each step is biased. Also, functional gradient descent has no convergence rate guarantees comparable to those for parametric gradient descent on convex objectives, because the loss surface in function space is typically non-convex.

report a correction →

AdaBoost as a Special Case

Proposition

AdaBoost Minimizes Exponential Loss

Statement

The AdaBoost algorithm -- which reweights training samples, fits a weak classifier to the reweighted data, and combines classifiers with data-dependent weights -- is equivalent to gradient boosting with the exponential loss $L(y, F) = e^{-yF}$ .

Intuition

The exponential loss penalizes misclassifications exponentially. Its negative gradient with respect to $F$ is $y e^{-yF}$ , which upweights samples that the current ensemble classifies incorrectly (large $e^{-yF}$ ). This is exactly the reweighting scheme of AdaBoost: misclassified points get higher weight in the next round.

Proof Sketch

The AdaBoost sample weight at round $m$ is $w_i^{(m)} \propto e^{-y_i F_{m-1}(x_i)}$ . The pseudo-residual for exponential loss is $r_i = y_i e^{-y_i F_{m-1}(x_i)}$ . Fitting a classifier to minimize weighted classification error with weights $w_i$ is equivalent to fitting to the pseudo-residuals. The optimal step size for exponential loss gives the AdaBoost classifier weight $\alpha_m = \frac{1}{2}\log\frac{1 - \epsilon_m}{\epsilon_m}$ where $\epsilon_m$ is the weighted error rate.

Why It Matters

This connection demystifies AdaBoost. It is not a separate algorithm with its own theory -- it is gradient boosting with a particular loss function. This also explains why AdaBoost is sensitive to outliers: the exponential loss grows without bound for misclassified points, giving outliers enormous influence.

Failure Mode

Exponential loss is not robust to label noise. A single mislabeled point with large margin generates a huge loss that dominates the gradient. In practice, logistic loss (log-loss) is preferred because it grows linearly (not exponentially) for large negative margins.

report a correction →

Shrinkage and the Learning Rate

The learning rate $\eta$ (also called shrinkage) controls how much each tree contributes. With $\eta = 1$ , each tree fully corrects in its direction. With $\eta \ll 1$ , each tree makes a small correction, and many more trees are needed.

Why does small $\eta$ help? It acts as a form of regularization:

Smoother approximation: small steps explore more of function space, avoiding committing too strongly to early trees
Better generalization: empirically, $\eta \in [0.01, 0.1]$ with many rounds consistently outperforms $\eta = 1$ with few rounds
Analogy to gradient descent: in parametric optimization, small learning rates with more iterations often find better solutions

The standard practice is to set $\eta$ small (0.01 to 0.1) and use early stopping on a validation set to choose the number of rounds $M$ .

XGBoost: Second-Order Boosting

XGBoost improves on standard gradient boosting by using a second-order Taylor expansion of the loss:

$\mathcal{L}^{(m)} \approx \sum_{i=1}^n \left[ g_i h_m(x_i) + \frac{1}{2} H_i h_m(x_i)^2 \right] + \Omega(h_m)$

where $g_i = \partial L / \partial F_{m-1}(x_i)$ is the gradient, $H_i = \partial^2 L / \partial F_{m-1}(x_i)^2$ is the Hessian (second derivative), and $\Omega(h_m) = \gamma T + \frac{1}{2}\lambda \sum_{j=1}^T w_j^2$ is a regularization term ( $T$ = number of leaves, $w_j$ = leaf weight).

Definition

XGBoost Split Gain $Gain$

For a candidate split dividing leaf node samples into left ( $I_L$ ) and right ( $I_R$ ) subsets, the split gain is:

$\text{Gain} = \frac{1}{2}\left[\frac{(\sum_{i \in I_L} g_i)^2}{\sum_{i \in I_L} H_i + \lambda} + \frac{(\sum_{i \in I_R} g_i)^2}{\sum_{i \in I_R} H_i + \lambda} - \frac{(\sum_{i \in I} g_i)^2}{\sum_{i \in I} H_i + \lambda}\right] - \gamma$

A split is made only if Gain $> 0$ . The regularization parameters $\lambda$ and $\gamma$ control overfitting: $\lambda$ shrinks leaf weights, $\gamma$ penalizes adding new leaves.

XGBoost also introduces several system-level optimizations:

Sorted-based splits: pre-sort features to find optimal thresholds in $O(n \log n)$ per feature
Histogram-based splits: bucket continuous features into discrete bins, reducing split finding to $O(n)$
Column subsampling: randomly sample features at each tree or each split level (like random forests), reducing overfitting and computation
Sparsity-aware splitting: handle missing values natively by learning a default direction at each split

LightGBM: Leaf-Wise Growth

LightGBM differs from XGBoost primarily in its tree-growing strategy:

XGBoost (default): level-wise (breadth-first) growth. All leaves at the same depth are split before moving deeper.
LightGBM: leaf-wise (best-first) growth. At each step, split the leaf with the largest gain, regardless of depth.

Leaf-wise growth produces asymmetric trees that can achieve lower training loss with fewer leaves. However, it can overfit more easily on small datasets because it creates deep, narrow trees. LightGBM mitigates this with a max_depth parameter.

Definition

Gradient-based One-Side Sampling (GOSS)

GOSS is a LightGBM technique for faster training. It keeps all instances with large gradients (they contribute most to the information gain) and randomly samples a fraction of instances with small gradients. The sampled small-gradient instances are upweighted to maintain unbiased gradient estimates. This reduces training time while preserving accuracy.

Why Boosting Reduces Bias

The fundamental asymmetry between bagging and boosting:

Bagging (random forests) averages independent copies of a high-variance base learner. The bias of the average equals the bias of a single learner (averaging does not reduce bias). Variance decreases as $O(1/B)$ .
Boosting starts with a high-bias base learner (shallow tree) and sequentially fits corrections. Each round greedily reduces the empirical loss by fitting a new learner to the negative gradient of the current ensemble. Empirical loss decreases by construction (each round either improves or, with a sufficient line search, leaves the loss unchanged); but bias with respect to the true regression function is not guaranteed to decrease monotonically. Greedy forward stagewise additions in a non-convex function-space landscape can introduce or shift bias in one round and reduce it in another, particularly under shrinkage and a restricted weak-learner class. The reliable claim is that boosting can trade base-learner bias for ensemble variance under suitable regularization, which is why deep ensembles of shallow trees often outperform a single deep tree.

This is why random forests use deep trees (low bias, high variance -- let averaging fix the variance) while gradient boosting uses shallow trees (high bias, low variance -- let sequential correction fix the bias).

Canonical Examples

Example

Gradient boosting update for squared loss

With squared loss $L(y, F) = \frac{1}{2}(y - F)^2$ :

Pseudo-residuals: $r_i = -(-(y_i - F_{m-1}(x_i))) = y_i - F_{m-1}(x_i)$
These are just the ordinary residuals
Fitting tree $h_m$ to residuals, then updating $F_m = F_{m-1} + \eta h_m$

After $M$ rounds with learning rate $\eta = 0.1$ , each tree corrects 10% of the remaining residual. With 100 trees, the ensemble has made 100 small gradient steps toward the minimum of the squared loss in function space.

Example

XGBoost split scoring

Consider a leaf with 4 samples. Gradients: $g = (-0.5, 0.3, -0.8, 0.6)$ . Hessians: $H = (0.25, 0.21, 0.16, 0.24)$ . With $\lambda = 1, \gamma = 0$ .

Unsplit leaf weight: $w^* = -\sum g_i / (\sum H_i + \lambda) = -(-0.4)/(0.86 + 1) = 0.215$ .

Consider splitting into left (samples 1,3) and right (samples 2,4):

Left: $G_L = -1.3, H_L = 0.41$ , score $= 1.3^2/(0.41 + 1) = 1.199$
Right: $G_R = 0.9, H_R = 0.45$ , score $= 0.9^2/(0.45 + 1) = 0.559$
Before: score $= 0.4^2/(0.86 + 1) = 0.086$

Gain $= 0.5(1.199 + 0.559 - 0.086) = 0.836 > 0$ , so the split is beneficial.

Common Confusions

Watch Out

Boosting does not always overfit with more rounds

A common belief is that more boosting rounds always leads to overfitting. With proper shrinkage ( $\eta$ small), boosting can run for thousands of rounds without significant overfitting. Early stopping on a validation set is the standard practice, but the point at which test error starts increasing depends heavily on the learning rate, tree depth, and regularization. AdaBoost with exponential loss is more prone to overfitting than gradient boosting with log-loss and regularization.

Watch Out

XGBoost is not a structurally different algorithm

XGBoost is gradient boosting with a second-order Taylor approximation, a regularized objective, and efficient systems engineering. The conceptual framework is the same: fit trees to (weighted) pseudo-residuals sequentially. The second-order information (Hessians) gives better split quality and leaf weights, analogous to how Newton's method improves over gradient descent by using curvature information.

Summary

Gradient boosting is functional gradient descent: each tree fits the negative gradient of the loss
For squared loss, pseudo-residuals are ordinary residuals $y_i - F_{m-1}(x_i)$
AdaBoost is gradient boosting with exponential loss
Shrinkage ( $\eta \ll 1$ ) regularizes; use many rounds with early stopping
XGBoost uses second-order Taylor expansion and regularized split scoring
LightGBM uses leaf-wise growth and GOSS for efficiency
Boosting reduces bias (unlike bagging which reduces variance)

Optional Deeper DetailPartial dependence plots and the interpretation of boosted ensemblesShow

Advanced section adapted from Hastie, Tibshirani, Friedman, The Elements of Statistical Learning (2nd ed., Springer 2009), §10.13 "Interpretation," pp. 369-373, and Friedman, J. H. (2001), "Greedy Function Approximation: A Gradient Boosting Machine," Annals of Statistics 29(5), 1189-1232, §8.

Boosted ensembles are accurate but not directly interpretable: the prediction is an additive combination of many shallow trees, and reading off the effect of a single feature requires aggregating contributions across all of them. The partial dependence function is the canonical tool for this aggregation. It is the ESL §10.13 alternative to local-explanation methods like LIME and SHAP, and it predates them by 15 years.

Definition. For a subset $\mathcal S \subset \{1, \ldots, d\}$ of features (typically a singleton or a pair), let $X_{\mathcal S}$ denote those features and $X_{\setminus \mathcal S}$ the rest. The partial dependence of the prediction on $X_{\mathcal S}$ is

$\bar f_{\mathcal S}(x_{\mathcal S}) \;=\; \mathbb E_{X_{\setminus \mathcal S}}[\hat f(x_{\mathcal S}, X_{\setminus \mathcal S})].$

In practice, replace the expectation by the empirical average over training points:

$\bar f_{\mathcal S}(x_{\mathcal S}) \;\approx\; \frac{1}{n} \sum_{i=1}^n \hat f(x_{\mathcal S}, x_{i, \setminus \mathcal S}).$

For $\mathcal S = \{j\}$ , this is a one-dimensional function of $x_j$ ; you plot it to see the average effect of $x_j$ on the prediction, marginalizing over all other features.

Efficient computation for trees. For a single tree, computing $\bar f_{\mathcal S}(x_{\mathcal S})$ by direct enumeration costs $O(n)$ per evaluation point. ESL §10.13.1 gives a clever recursion that traverses the tree once, weighting each branch by the training-data fraction that falls into it. The recursion is $O(\text{leaves})$ per evaluation point, independent of $n$ . For a forest of $B$ trees, the total cost is $O(B \cdot \text{leaves})$ . This is what makes PDP practical for boosted ensembles.

What PDP shows that single-tree visualizations do not.

PDP integrates out the dependence on other features, exposing the marginal contribution of $x_j$ even when $x_j$ interacts with many features in the underlying model.
Two-feature PDPs ( $\mathcal S = \{j, k\}$ ) reveal interactions: a non-additive PDP surface in $(x_j, x_k)$ means the model uses an interaction term.
The shape of $\bar f_j(x_j)$ corresponds to the variable-importance ranking but adds the direction and non-linearity of the effect, not just its magnitude.

Caveats.

PDP averages over the empirical distribution of $X_{\setminus \mathcal S}$ . If $x_j$ is correlated with another feature $x_k$ , then $\bar f_j$ at extreme values of $x_j$ is computed by averaging over $x_k$ values that rarely co-occur with that $x_j$ . The plot then shows the model's extrapolation behavior, not its behavior on the data manifold.
The fix is accumulated local effects (ALE) plots (Apley-Zhu 2020 JRSS-B), which condition on the local distribution of $X_{\setminus \mathcal S}$ instead of marginalizing. ALE is the modern descendant of PDP.

Connection to SHAP/LIME. SHAP values (Lundberg-Lee 2017 NeurIPS) are local explanations that satisfy Shapley-value axioms; LIME (Ribeiro et al. 2016) fits a local linear surrogate. PDP is global (it summarizes the model across the whole training distribution) where SHAP/LIME are local (per-instance). The two complement each other; ESL §10.13's marginal-effect framing motivates the SHAP/LIME literature even when those papers do not cite it directly.

Exercises

ExerciseCore

Problem

Derive the gradient boosting update for squared loss. Starting from $L(y, F) = \frac{1}{2}(y - F)^2$ , compute the pseudo-residual $r_i = -\partial L / \partial F$ and write the update $F_m = F_{m-1} + \eta h_m$ . What does $h_m$ fit to?

ExerciseAdvanced

Problem

Compute the pseudo-residuals for logistic loss $L(y, F) = \log(1 + e^{-yF})$ where $y \in \{-1, +1\}$ . Compare with the AdaBoost reweighting scheme using exponential loss.

References

Canonical:

Friedman, "Greedy Function Approximation: A Gradient Boosting Machine" (2001) -- the foundational paper
Freund & Schapire, "A Decision-Theoretic Generalization of On-Line Learning" (1997) -- AdaBoost
Hastie, T., Tibshirani, R., and Friedman, J. (2009). The Elements of Statistical Learning, 2nd ed. Springer. §10 (boosting and additive trees), §10.5 (forward stagewise additive modeling), §10.9-10.10 (numerical optimization and shrinkage), §10.12 (regularization and right-sized trees), §10.13 (interpretation and partial dependence plots), §10.14 (illustrations on real data).

Current:

Apley and Zhu, "Visualizing the Effects of Predictor Variables in Black Box Supervised Learning Models" (2020). accumulated local effects
Chen & Guestrin, "XGBoost: A Scalable Tree Boosting System" (2016)
Ke et al., "LightGBM: A Highly Efficient Gradient Boosting Decision Tree" (2017)

Next Topics

Building on gradient boosting:

Regularization theory: the general framework for controlling overfitting, of which shrinkage is one instance
Feedforward networks and backpropagation: a different approach to sequential function composition

Last reviewed: April 26, 2026

Canonical graph

Required before and derived from this topic

These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.

Full prerequisite chain All derived topics

Required prerequisites

5

The Elements of Statistical Learning (Hastie, Tibshirani, Friedman)layer 0B · tier 1
Gradient Descent Variantslayer 1 · tier 1
AdaBoostlayer 2 · tier 2
Decision Trees and Ensembleslayer 2 · tier 2
Cubist and Model Treeslayer 2 · tier 3

Derived topics

4

Feedforward Networks and Backpropagationlayer 2 · tier 1
Ensemble Methods Theorylayer 2 · tier 2
Regularization Theorylayer 2 · tier 2
XGBoostlayer 2 · tier 2

Graph-backed continuations

Regularization Theory Feedforward Networks and Backpropagation Ensemble Methods Theory XGBoost