ML Methods
Gradient Boosting
Gradient boosting as functional gradient descent: fit weak learners to pseudo-residuals sequentially, reducing bias at each round. Covers AdaBoost, shrinkage, XGBoost second-order methods, and LightGBM leaf-wise growth.
Why This Matters
Gradient-boosted decision trees are the dominant method for tabular data. XGBoost and LightGBM win most Kaggle competitions involving structured data, and they power production systems at scale. Understanding gradient boosting as functional gradient descent -- fitting each tree to the negative gradient of the loss -- gives you the theoretical foundation to understand shrinkage, regularization, and the differences between implementations.
More . boosting is the complement of bagging. Bagging (random forests) reduces variance by averaging independent high-variance learners. Boosting reduces bias by sequentially correcting the errors of weak learners. Understanding this bias-variance tradeoff is essential for choosing and tuning ensemble methods.
Mental Model
Imagine you have a bad predictor . Its errors (residuals) form a pattern. You fit a small tree to those residuals. Adding that tree to partially corrects the errors. The remaining residuals form a new pattern. You fit another tree. Repeat. Each round reduces the bias of the ensemble by targeting whatever structure the current model is missing.
The key insight: this process is gradient descent in function space. The residuals are the negative gradient of the loss with respect to the current prediction. Each tree is one gradient step.
Formal Setup
Let be training data. We build an additive model:
where is an initial prediction (often the mean of ), each is a decision tree, and is the learning rate (shrinkage).
Pseudo-Residuals
At round , the pseudo-residual for sample is the negative gradient of the loss with respect to the current prediction:
The new tree is fit to these pseudo-residuals. For squared loss , the pseudo-residual is simply , the ordinary residual.
Main Theorems
Gradient Boosting as Functional Gradient Descent
Statement
The gradient boosting update
where , is a steepest descent step in the function space , where the descent direction is projected onto the hypothesis class .
Intuition
In ordinary gradient descent, you compute the gradient of a loss with respect to parameters and take a step. In gradient boosting, the "parameters" are the function values . The gradient of the total loss with respect to these function values is the vector of pseudo-residuals. Fitting a tree to the pseudo-residuals is projecting this gradient onto the space of trees -- the closest tree to the gradient direction. Adding it with step size is a gradient step in function space.
Proof Sketch
The empirical loss is . The functional gradient is . The negative gradient is the vector of pseudo-residuals . We cannot step in an arbitrary direction in function space (infinite-dimensional), so we project onto by finding . Then is a projected gradient descent step.
Why It Matters
This viewpoint unifies all boosting variants. Change the loss function and you get different pseudo-residuals: squared loss gives ordinary residuals, logistic loss gives probability-weighted residuals, quantile loss gives sign-based residuals. The framework extends to any differentiable loss.
Failure Mode
The projection onto is only approximate. If the tree class cannot represent the true gradient direction, each step is biased. Also, functional gradient descent has no convergence rate guarantees comparable to those for parametric gradient descent on convex objectives, because the loss surface in function space is typically non-convex.
AdaBoost as a Special Case
AdaBoost Minimizes Exponential Loss
Statement
The AdaBoost algorithm -- which reweights training samples, fits a weak classifier to the reweighted data, and combines classifiers with data-dependent weights -- is equivalent to gradient boosting with the exponential loss .
Intuition
The exponential loss penalizes misclassifications exponentially. Its negative gradient with respect to is , which upweights samples that the current ensemble classifies incorrectly (large ). This is exactly the reweighting scheme of AdaBoost: misclassified points get higher weight in the next round.
Proof Sketch
The AdaBoost sample weight at round is . The pseudo-residual for exponential loss is . Fitting a classifier to minimize weighted classification error with weights is equivalent to fitting to the pseudo-residuals. The optimal step size for exponential loss gives the AdaBoost classifier weight where is the weighted error rate.
Why It Matters
This connection demystifies AdaBoost. It is not a separate algorithm with its own theory -- it is gradient boosting with a particular loss function. This also explains why AdaBoost is sensitive to outliers: the exponential loss grows without bound for misclassified points, giving outliers enormous influence.
Failure Mode
Exponential loss is not robust to label noise. A single mislabeled point with large margin generates a huge loss that dominates the gradient. In practice, logistic loss (log-loss) is preferred because it grows linearly (not exponentially) for large negative margins.
Shrinkage and the Learning Rate
The learning rate (also called shrinkage) controls how much each tree contributes. With , each tree fully corrects in its direction. With , each tree makes a small correction, and many more trees are needed.
Why does small help? It acts as a form of regularization:
- Smoother approximation: small steps explore more of function space, avoiding committing too strongly to early trees
- Better generalization: empirically, with many rounds consistently outperforms with few rounds
- Analogy to gradient descent: in parametric optimization, small learning rates with more iterations often find better solutions
The standard practice is to set small (0.01 to 0.1) and use early stopping on a validation set to choose the number of rounds .
XGBoost: Second-Order Boosting
XGBoost improves on standard gradient boosting by using a second-order Taylor expansion of the loss:
where is the gradient, is the Hessian (second derivative), and is a regularization term ( = number of leaves, = leaf weight).
XGBoost Split Gain
For a candidate split dividing leaf node samples into left () and right () subsets, the split gain is:
A split is made only if Gain . The regularization parameters and control overfitting: shrinks leaf weights, penalizes adding new leaves.
XGBoost also introduces several system-level optimizations:
- Sorted-based splits: pre-sort features to find optimal thresholds in per feature
- Histogram-based splits: bucket continuous features into discrete bins, reducing split finding to
- Column subsampling: randomly sample features at each tree or each split level (like random forests), reducing overfitting and computation
- Sparsity-aware splitting: handle missing values natively by learning a default direction at each split
LightGBM: Leaf-Wise Growth
LightGBM differs from XGBoost primarily in its tree-growing strategy:
- XGBoost (default): level-wise (breadth-first) growth. All leaves at the same depth are split before moving deeper.
- LightGBM: leaf-wise (best-first) growth. At each step, split the leaf with the largest gain, regardless of depth.
Leaf-wise growth produces asymmetric trees that can achieve lower training
loss with fewer leaves. However, it can overfit more easily on small
datasets because it creates deep, narrow trees. LightGBM mitigates this
with a max_depth parameter.
Gradient-based One-Side Sampling (GOSS)
GOSS is a LightGBM technique for faster training. It keeps all instances with large gradients (they contribute most to the information gain) and randomly samples a fraction of instances with small gradients. The sampled small-gradient instances are upweighted to maintain unbiased gradient estimates. This reduces training time while preserving accuracy.
Why Boosting Reduces Bias
The fundamental asymmetry between bagging and boosting:
-
Bagging (random forests) averages independent copies of a high-variance base learner. The bias of the average equals the bias of a single learner (averaging does not reduce bias). Variance decreases as .
-
Boosting starts with a high-bias base learner (shallow tree) and sequentially fits corrections. Each round reduces the residual error -- the remaining bias -- by fitting a new learner to whatever structure the current ensemble misses. The bias decreases monotonically with the number of rounds.
This is why random forests use deep trees (low bias, high variance -- let averaging fix the variance) while gradient boosting uses shallow trees (high bias, low variance -- let sequential correction fix the bias).
Canonical Examples
Gradient boosting update for squared loss
With squared loss :
- Pseudo-residuals:
- These are just the ordinary residuals
- Fitting tree to residuals, then updating
After rounds with learning rate , each tree corrects 10% of the remaining residual. With 100 trees, the ensemble has made 100 small gradient steps toward the minimum of the squared loss in function space.
XGBoost split scoring
Consider a leaf with 4 samples. Gradients: . Hessians: . With .
Unsplit leaf weight: .
Consider splitting into left (samples 1,3) and right (samples 2,4):
- Left: , score
- Right: , score
- Before: score
Gain , so the split is beneficial.
Common Confusions
Boosting does not always overfit with more rounds
A common belief is that more boosting rounds always leads to overfitting. With proper shrinkage ( small), boosting can run for thousands of rounds without significant overfitting. Early stopping on a validation set is the standard practice, but the point at which test error starts increasing depends heavily on the learning rate, tree depth, and regularization. AdaBoost with exponential loss is more prone to overfitting than gradient boosting with log-loss and regularization.
XGBoost is not a structurally different algorithm
XGBoost is gradient boosting with a second-order Taylor approximation, a regularized objective, and efficient systems engineering. The conceptual framework is the same: fit trees to (weighted) pseudo-residuals sequentially. The second-order information (Hessians) gives better split quality and leaf weights, analogous to how Newton's method improves over gradient descent by using curvature information.
Summary
- Gradient boosting is functional gradient descent: each tree fits the negative gradient of the loss
- For squared loss, pseudo-residuals are ordinary residuals
- AdaBoost is gradient boosting with exponential loss
- Shrinkage () regularizes; use many rounds with early stopping
- XGBoost uses second-order Taylor expansion and regularized split scoring
- LightGBM uses leaf-wise growth and GOSS for efficiency
- Boosting reduces bias (unlike bagging which reduces variance)
Exercises
Problem
Derive the gradient boosting update for squared loss. Starting from , compute the pseudo-residual and write the update . What does fit to?
Problem
Compute the pseudo-residuals for logistic loss where . Compare with the AdaBoost reweighting scheme using exponential loss.
References
Canonical:
- Friedman, "Greedy Function Approximation: A Gradient Boosting Machine" (2001) -- the foundational paper
- Freund & Schapire, "A Decision-Theoretic Generalization of On-Line Learning" (1997) -- AdaBoost
- Hastie, Tibshirani, Friedman, Elements of Statistical Learning, Chapters 10-12
Current:
-
Chen & Guestrin, "XGBoost: A Scalable Tree Boosting System" (2016)
-
Ke et al., "LightGBM: A Highly Efficient Gradient Boosting Decision Tree" (2017)
-
Bishop, Pattern Recognition and Machine Learning (2006), Chapters 1-14
Next Topics
Building on gradient boosting:
- Regularization theory: the general framework for controlling overfitting, of which shrinkage is one instance
- Feedforward networks and backpropagation: a different approach to sequential function composition
Last reviewed: April 2026
Prerequisites
Foundations this topic depends on.
- Decision Trees and EnsemblesLayer 2
- Empirical Risk MinimizationLayer 2
- Concentration InequalitiesLayer 1
- Common Probability DistributionsLayer 0A
- Sets, Functions, and RelationsLayer 0A
- Basic Logic and Proof TechniquesLayer 0A
- Expectation, Variance, Covariance, and MomentsLayer 0A
- Bias-Variance TradeoffLayer 2
- Gradient Descent VariantsLayer 1
- Convex Optimization BasicsLayer 1
- Differentiation in RnLayer 0A
- Matrix Operations and PropertiesLayer 0A
Builds on This
- Ensemble Methods TheoryLayer 2
- Feature Importance and InterpretabilityLayer 2
- XGBoostLayer 2