Beta. Content is under active construction and has not been peer-reviewed. Report errors on GitHub.Disclaimer

ML Methods

Gradient Boosting

Gradient boosting as functional gradient descent: fit weak learners to pseudo-residuals sequentially, reducing bias at each round. Covers AdaBoost, shrinkage, XGBoost second-order methods, and LightGBM leaf-wise growth.

AdvancedTier 1Stable~60 min

Why This Matters

Gradient Boosting: Sequential Residual Fittingy = 8.0F₀(x)Initial: mean(y)pred = 5.0h₁(x)Fit residual r₁ = y - F₀+2.5F₁ = F₀ + h₁Updated modelpred = 7.5h₂(x)Fit residual r₂ = y - F₁+0.4F₂ = F₁ + h₂Better modelpred = 7.9Residual error at each stage3.0F₀0.5F₁0.1F₂error shrinks with each boosting round

Gradient-boosted decision trees are the dominant method for tabular data. XGBoost and LightGBM win most Kaggle competitions involving structured data, and they power production systems at scale. Understanding gradient boosting as functional gradient descent -- fitting each tree to the negative gradient of the loss -- gives you the theoretical foundation to understand shrinkage, regularization, and the differences between implementations.

More . boosting is the complement of bagging. Bagging (random forests) reduces variance by averaging independent high-variance learners. Boosting reduces bias by sequentially correcting the errors of weak learners. Understanding this bias-variance tradeoff is essential for choosing and tuning ensemble methods.

Mental Model

Imagine you have a bad predictor F0(x)F_0(x). Its errors (residuals) form a pattern. You fit a small tree to those residuals. Adding that tree to F0F_0 partially corrects the errors. The remaining residuals form a new pattern. You fit another tree. Repeat. Each round reduces the bias of the ensemble by targeting whatever structure the current model is missing.

The key insight: this process is gradient descent in function space. The residuals are the negative gradient of the loss with respect to the current prediction. Each tree is one gradient step.

Formal Setup

Let {(xi,yi)}i=1n\{(x_i, y_i)\}_{i=1}^n be training data. We build an additive model:

FM(x)=F0(x)+m=1Mηhm(x)F_M(x) = F_0(x) + \sum_{m=1}^{M} \eta \, h_m(x)

where F0F_0 is an initial prediction (often the mean of yy), each hmh_m is a decision tree, and η(0,1]\eta \in (0, 1] is the learning rate (shrinkage).

Definition

Pseudo-Residuals

At round mm, the pseudo-residual for sample ii is the negative gradient of the loss with respect to the current prediction:

ri(m)=L(yi,Fm1(xi))Fm1(xi)r_i^{(m)} = -\frac{\partial L(y_i, F_{m-1}(x_i))}{\partial F_{m-1}(x_i)}

The new tree hmh_m is fit to these pseudo-residuals. For squared loss L(y,F)=12(yF)2L(y, F) = \frac{1}{2}(y - F)^2, the pseudo-residual is simply ri=yiFm1(xi)r_i = y_i - F_{m-1}(x_i), the ordinary residual.

Main Theorems

Theorem

Gradient Boosting as Functional Gradient Descent

Statement

The gradient boosting update

Fm(x)=Fm1(x)+ηhm(x)F_m(x) = F_{m-1}(x) + \eta \, h_m(x)

where hm=argminhHi=1n(ri(m)h(xi))2h_m = \arg\min_{h \in \mathcal{H}} \sum_{i=1}^n (r_i^{(m)} - h(x_i))^2, is a steepest descent step in the function space L2L^2, where the descent direction is projected onto the hypothesis class H\mathcal{H}.

Intuition

In ordinary gradient descent, you compute the gradient of a loss with respect to parameters and take a step. In gradient boosting, the "parameters" are the function values F(x1),,F(xn)F(x_1), \ldots, F(x_n). The gradient of the total loss with respect to these function values is the vector of pseudo-residuals. Fitting a tree to the pseudo-residuals is projecting this gradient onto the space of trees -- the closest tree to the gradient direction. Adding it with step size η\eta is a gradient step in function space.

Proof Sketch

The empirical loss is L(F)=i=1nL(yi,F(xi))\mathcal{L}(F) = \sum_{i=1}^n L(y_i, F(x_i)). The functional gradient is FL=(L/F(xi))i=1n\nabla_F \mathcal{L} = (\partial L / \partial F(x_i))_{i=1}^n. The negative gradient is the vector of pseudo-residuals (r1,,rn)(r_1, \ldots, r_n). We cannot step in an arbitrary direction in function space (infinite-dimensional), so we project onto H\mathcal{H} by finding hm=argminhi(rih(xi))2h_m = \arg\min_h \sum_i (r_i - h(x_i))^2. Then Fm=Fm1+ηhmF_m = F_{m-1} + \eta h_m is a projected gradient descent step.

Why It Matters

This viewpoint unifies all boosting variants. Change the loss function and you get different pseudo-residuals: squared loss gives ordinary residuals, logistic loss gives probability-weighted residuals, quantile loss gives sign-based residuals. The framework extends to any differentiable loss.

Failure Mode

The projection onto H\mathcal{H} is only approximate. If the tree class cannot represent the true gradient direction, each step is biased. Also, functional gradient descent has no convergence rate guarantees comparable to those for parametric gradient descent on convex objectives, because the loss surface in function space is typically non-convex.

AdaBoost as a Special Case

Proposition

AdaBoost Minimizes Exponential Loss

Statement

The AdaBoost algorithm -- which reweights training samples, fits a weak classifier to the reweighted data, and combines classifiers with data-dependent weights -- is equivalent to gradient boosting with the exponential loss L(y,F)=eyFL(y, F) = e^{-yF}.

Intuition

The exponential loss penalizes misclassifications exponentially. Its negative gradient with respect to FF is yeyFy e^{-yF}, which upweights samples that the current ensemble classifies incorrectly (large eyFe^{-yF}). This is exactly the reweighting scheme of AdaBoost: misclassified points get higher weight in the next round.

Proof Sketch

The AdaBoost sample weight at round mm is wi(m)eyiFm1(xi)w_i^{(m)} \propto e^{-y_i F_{m-1}(x_i)}. The pseudo-residual for exponential loss is ri=yieyiFm1(xi)r_i = y_i e^{-y_i F_{m-1}(x_i)}. Fitting a classifier to minimize weighted classification error with weights wiw_i is equivalent to fitting to the pseudo-residuals. The optimal step size for exponential loss gives the AdaBoost classifier weight αm=12log1ϵmϵm\alpha_m = \frac{1}{2}\log\frac{1 - \epsilon_m}{\epsilon_m} where ϵm\epsilon_m is the weighted error rate.

Why It Matters

This connection demystifies AdaBoost. It is not a separate algorithm with its own theory -- it is gradient boosting with a particular loss function. This also explains why AdaBoost is sensitive to outliers: the exponential loss grows without bound for misclassified points, giving outliers enormous influence.

Failure Mode

Exponential loss is not robust to label noise. A single mislabeled point with large margin generates a huge loss that dominates the gradient. In practice, logistic loss (log-loss) is preferred because it grows linearly (not exponentially) for large negative margins.

Shrinkage and the Learning Rate

The learning rate η\eta (also called shrinkage) controls how much each tree contributes. With η=1\eta = 1, each tree fully corrects in its direction. With η1\eta \ll 1, each tree makes a small correction, and many more trees are needed.

Why does small η\eta help? It acts as a form of regularization:

  1. Smoother approximation: small steps explore more of function space, avoiding committing too strongly to early trees
  2. Better generalization: empirically, η[0.01,0.1]\eta \in [0.01, 0.1] with many rounds consistently outperforms η=1\eta = 1 with few rounds
  3. Analogy to gradient descent: in parametric optimization, small learning rates with more iterations often find better solutions

The standard practice is to set η\eta small (0.01 to 0.1) and use early stopping on a validation set to choose the number of rounds MM.

XGBoost: Second-Order Boosting

XGBoost improves on standard gradient boosting by using a second-order Taylor expansion of the loss:

L(m)i=1n[gihm(xi)+12Hihm(xi)2]+Ω(hm)\mathcal{L}^{(m)} \approx \sum_{i=1}^n \left[ g_i h_m(x_i) + \frac{1}{2} H_i h_m(x_i)^2 \right] + \Omega(h_m)

where gi=L/Fm1(xi)g_i = \partial L / \partial F_{m-1}(x_i) is the gradient, Hi=2L/Fm1(xi)2H_i = \partial^2 L / \partial F_{m-1}(x_i)^2 is the Hessian (second derivative), and Ω(hm)=γT+12λj=1Twj2\Omega(h_m) = \gamma T + \frac{1}{2}\lambda \sum_{j=1}^T w_j^2 is a regularization term (TT = number of leaves, wjw_j = leaf weight).

Definition

XGBoost Split Gain

For a candidate split dividing leaf node samples into left (ILI_L) and right (IRI_R) subsets, the split gain is:

Gain=12[(iILgi)2iILHi+λ+(iIRgi)2iIRHi+λ(iIgi)2iIHi+λ]γ\text{Gain} = \frac{1}{2}\left[\frac{(\sum_{i \in I_L} g_i)^2}{\sum_{i \in I_L} H_i + \lambda} + \frac{(\sum_{i \in I_R} g_i)^2}{\sum_{i \in I_R} H_i + \lambda} - \frac{(\sum_{i \in I} g_i)^2}{\sum_{i \in I} H_i + \lambda}\right] - \gamma

A split is made only if Gain >0> 0. The regularization parameters λ\lambda and γ\gamma control overfitting: λ\lambda shrinks leaf weights, γ\gamma penalizes adding new leaves.

XGBoost also introduces several system-level optimizations:

  • Sorted-based splits: pre-sort features to find optimal thresholds in O(nlogn)O(n \log n) per feature
  • Histogram-based splits: bucket continuous features into discrete bins, reducing split finding to O(n)O(n)
  • Column subsampling: randomly sample features at each tree or each split level (like random forests), reducing overfitting and computation
  • Sparsity-aware splitting: handle missing values natively by learning a default direction at each split

LightGBM: Leaf-Wise Growth

LightGBM differs from XGBoost primarily in its tree-growing strategy:

  • XGBoost (default): level-wise (breadth-first) growth. All leaves at the same depth are split before moving deeper.
  • LightGBM: leaf-wise (best-first) growth. At each step, split the leaf with the largest gain, regardless of depth.

Leaf-wise growth produces asymmetric trees that can achieve lower training loss with fewer leaves. However, it can overfit more easily on small datasets because it creates deep, narrow trees. LightGBM mitigates this with a max_depth parameter.

Definition

Gradient-based One-Side Sampling (GOSS)

GOSS is a LightGBM technique for faster training. It keeps all instances with large gradients (they contribute most to the information gain) and randomly samples a fraction of instances with small gradients. The sampled small-gradient instances are upweighted to maintain unbiased gradient estimates. This reduces training time while preserving accuracy.

Why Boosting Reduces Bias

The fundamental asymmetry between bagging and boosting:

  • Bagging (random forests) averages independent copies of a high-variance base learner. The bias of the average equals the bias of a single learner (averaging does not reduce bias). Variance decreases as O(1/B)O(1/B).

  • Boosting starts with a high-bias base learner (shallow tree) and sequentially fits corrections. Each round reduces the residual error -- the remaining bias -- by fitting a new learner to whatever structure the current ensemble misses. The bias decreases monotonically with the number of rounds.

This is why random forests use deep trees (low bias, high variance -- let averaging fix the variance) while gradient boosting uses shallow trees (high bias, low variance -- let sequential correction fix the bias).

Canonical Examples

Example

Gradient boosting update for squared loss

With squared loss L(y,F)=12(yF)2L(y, F) = \frac{1}{2}(y - F)^2:

  • Pseudo-residuals: ri=((yiFm1(xi)))=yiFm1(xi)r_i = -(-(y_i - F_{m-1}(x_i))) = y_i - F_{m-1}(x_i)
  • These are just the ordinary residuals
  • Fitting tree hmh_m to residuals, then updating Fm=Fm1+ηhmF_m = F_{m-1} + \eta h_m

After MM rounds with learning rate η=0.1\eta = 0.1, each tree corrects 10% of the remaining residual. With 100 trees, the ensemble has made 100 small gradient steps toward the minimum of the squared loss in function space.

Example

XGBoost split scoring

Consider a leaf with 4 samples. Gradients: g=(0.5,0.3,0.8,0.6)g = (-0.5, 0.3, -0.8, 0.6). Hessians: H=(0.25,0.21,0.16,0.24)H = (0.25, 0.21, 0.16, 0.24). With λ=1,γ=0\lambda = 1, \gamma = 0.

Unsplit leaf weight: w=gi/(Hi+λ)=(0.4)/(0.86+1)=0.215w^* = -\sum g_i / (\sum H_i + \lambda) = -(-0.4)/(0.86 + 1) = 0.215.

Consider splitting into left (samples 1,3) and right (samples 2,4):

  • Left: GL=1.3,HL=0.41G_L = -1.3, H_L = 0.41, score =1.32/(0.41+1)=1.199= 1.3^2/(0.41 + 1) = 1.199
  • Right: GR=0.9,HR=0.45G_R = 0.9, H_R = 0.45, score =0.92/(0.45+1)=0.559= 0.9^2/(0.45 + 1) = 0.559
  • Before: score =0.42/(0.86+1)=0.086= 0.4^2/(0.86 + 1) = 0.086

Gain =0.5(1.199+0.5590.086)=0.836>0= 0.5(1.199 + 0.559 - 0.086) = 0.836 > 0, so the split is beneficial.

Common Confusions

Watch Out

Boosting does not always overfit with more rounds

A common belief is that more boosting rounds always leads to overfitting. With proper shrinkage (η\eta small), boosting can run for thousands of rounds without significant overfitting. Early stopping on a validation set is the standard practice, but the point at which test error starts increasing depends heavily on the learning rate, tree depth, and regularization. AdaBoost with exponential loss is more prone to overfitting than gradient boosting with log-loss and regularization.

Watch Out

XGBoost is not a structurally different algorithm

XGBoost is gradient boosting with a second-order Taylor approximation, a regularized objective, and efficient systems engineering. The conceptual framework is the same: fit trees to (weighted) pseudo-residuals sequentially. The second-order information (Hessians) gives better split quality and leaf weights, analogous to how Newton's method improves over gradient descent by using curvature information.

Summary

  • Gradient boosting is functional gradient descent: each tree fits the negative gradient of the loss
  • For squared loss, pseudo-residuals are ordinary residuals yiFm1(xi)y_i - F_{m-1}(x_i)
  • AdaBoost is gradient boosting with exponential loss
  • Shrinkage (η1\eta \ll 1) regularizes; use many rounds with early stopping
  • XGBoost uses second-order Taylor expansion and regularized split scoring
  • LightGBM uses leaf-wise growth and GOSS for efficiency
  • Boosting reduces bias (unlike bagging which reduces variance)

Exercises

ExerciseCore

Problem

Derive the gradient boosting update for squared loss. Starting from L(y,F)=12(yF)2L(y, F) = \frac{1}{2}(y - F)^2, compute the pseudo-residual ri=L/Fr_i = -\partial L / \partial F and write the update Fm=Fm1+ηhmF_m = F_{m-1} + \eta h_m. What does hmh_m fit to?

ExerciseAdvanced

Problem

Compute the pseudo-residuals for logistic loss L(y,F)=log(1+eyF)L(y, F) = \log(1 + e^{-yF}) where y{1,+1}y \in \{-1, +1\}. Compare with the AdaBoost reweighting scheme using exponential loss.

References

Canonical:

  • Friedman, "Greedy Function Approximation: A Gradient Boosting Machine" (2001) -- the foundational paper
  • Freund & Schapire, "A Decision-Theoretic Generalization of On-Line Learning" (1997) -- AdaBoost
  • Hastie, Tibshirani, Friedman, Elements of Statistical Learning, Chapters 10-12

Current:

  • Chen & Guestrin, "XGBoost: A Scalable Tree Boosting System" (2016)

  • Ke et al., "LightGBM: A Highly Efficient Gradient Boosting Decision Tree" (2017)

  • Bishop, Pattern Recognition and Machine Learning (2006), Chapters 1-14

Next Topics

Building on gradient boosting:

Last reviewed: April 2026

Prerequisites

Foundations this topic depends on.

Builds on This

Next Topics