Beta. Content is under active construction and has not been peer-reviewed. Report errors on GitHub.Disclaimer

ML Methods

XGBoost

XGBoost as second-order gradient boosting: Taylor expansion of the loss, regularized objective, optimal leaf weights, split gain formula, and the system optimizations that made it dominant on tabular data.

CoreTier 2Stable~50 min

Prerequisites

0

Why This Matters

XGBoost is the most successful algorithm for structured (tabular) data. It dominated Kaggle competitions for years and remains a production standard. Understanding XGBoost means understanding three things:

  1. The math: second-order Taylor expansion of the loss gives better splits and leaf weights than first-order gradient boosting
  2. The regularization: explicit L1 and L2 penalties on the tree structure control overfitting
  3. The engineering: histogram-based binning, column subsampling, cache-aware access, and out-of-core computation make it fast at scale

XGBoost is not a new algorithm. It is gradient boosting done carefully, with second-order information and principled regularization. Understanding the math explains why it works better; understanding the systems explains why it runs faster.

Mental Model

Standard gradient boosting fits each tree to the negative gradient (first-order information only). This is like using gradient descent. XGBoost fits each tree using both the gradient and the Hessian (second-order information). This is like using Newton's method. The Hessian tells you how much to trust the gradient: where the loss is sharply curved, take smaller steps; where it is flat, take larger steps.

On top of this, XGBoost adds an explicit penalty on the number of leaves and the magnitude of leaf weights, which directly controls model complexity.

Formal Setup

At round mm, we add a tree hmh_m to the ensemble Fm1F_{m-1}. The objective is:

L(m)=i=1nL(yi,Fm1(xi)+hm(xi))+Ω(hm)\mathcal{L}^{(m)} = \sum_{i=1}^n L(y_i, F_{m-1}(x_i) + h_m(x_i)) + \Omega(h_m)

Definition

Second-Order Taylor Approximation

Expanding L(yi,Fm1(xi)+hm(xi))L(y_i, F_{m-1}(x_i) + h_m(x_i)) around Fm1(xi)F_{m-1}(x_i) to second order:

L(m)i=1n[L(yi,Fm1(xi))+gihm(xi)+12Hihm(xi)2]+Ω(hm)\mathcal{L}^{(m)} \approx \sum_{i=1}^n \left[L(y_i, F_{m-1}(x_i)) + g_i h_m(x_i) + \frac{1}{2}H_i h_m(x_i)^2\right] + \Omega(h_m)

where:

  • gi=L(yi,Fm1(xi))Fm1(xi)g_i = \frac{\partial L(y_i, F_{m-1}(x_i))}{\partial F_{m-1}(x_i)} is the gradient (first derivative)
  • Hi=2L(yi,Fm1(xi))Fm1(xi)2H_i = \frac{\partial^2 L(y_i, F_{m-1}(x_i))}{\partial F_{m-1}(x_i)^2} is the Hessian (second derivative)

Dropping the constant term L(yi,Fm1(xi))L(y_i, F_{m-1}(x_i)), we optimize:

L~(m)=i=1n[gihm(xi)+12Hihm(xi)2]+Ω(hm)\tilde{\mathcal{L}}^{(m)} = \sum_{i=1}^n \left[g_i h_m(x_i) + \frac{1}{2}H_i h_m(x_i)^2\right] + \Omega(h_m)

Definition

Regularized Tree Complexity

XGBoost defines tree complexity as:

Ω(h)=γT+12λj=1Twj2\Omega(h) = \gamma T + \frac{1}{2}\lambda \sum_{j=1}^T w_j^2

where TT is the number of leaf nodes and wjw_j is the weight (prediction value) of leaf jj. The parameter γ\gamma penalizes the number of leaves (tree size) and λ\lambda penalizes the magnitude of leaf weights (analogous to L2 regularization on parameters).

An optional L1 penalty αjwj\alpha \sum_j |w_j| can be added for sparsity in leaf weights.

Main Theorems

Proposition

Optimal Leaf Weight

Statement

For a tree with fixed structure, if leaf jj contains the set of samples IjI_j, the optimal leaf weight minimizing L~\tilde{\mathcal{L}} is:

wj=iIjgiiIjHi+λw_j^* = -\frac{\sum_{i \in I_j} g_i}{\sum_{i \in I_j} H_i + \lambda}

The corresponding minimum objective value is:

L~=12j=1T(iIjgi)2iIjHi+λ+γT\tilde{\mathcal{L}}^* = -\frac{1}{2}\sum_{j=1}^T \frac{(\sum_{i \in I_j} g_i)^2}{\sum_{i \in I_j} H_i + \lambda} + \gamma T

Intuition

The optimal leaf weight is a ratio: the sum of gradients (what direction to push) divided by the sum of Hessians plus regularization (how much to trust that direction). Large Hessians mean the loss is sharply curved, so the optimal step is small. The regularization λ\lambda further shrinks the weight, preventing any leaf from making too large a prediction.

Compare with Newton's method update: Δ=g/H\Delta = -g / H. The leaf weight is exactly a regularized Newton step.

Proof Sketch

Group samples by their assigned leaf. For leaf jj, the contribution to the objective is:

iIj[giwj+12Hiwj2]+12λwj2=Gjwj+12(Hj+λ)wj2\sum_{i \in I_j} [g_i w_j + \frac{1}{2}H_i w_j^2] + \frac{1}{2}\lambda w_j^2 = G_j w_j + \frac{1}{2}(H_j + \lambda) w_j^2

where Gj=iIjgiG_j = \sum_{i \in I_j} g_i and Hj=iIjHiH_j = \sum_{i \in I_j} H_i. This is a quadratic in wjw_j with minimum at wj=Gj/(Hj+λ)w_j^* = -G_j/(H_j + \lambda). Substituting back: minimum value =Gj2/(2(Hj+λ))= -G_j^2 / (2(H_j + \lambda)). Summing over leaves gives the result.

Why It Matters

This closed-form formula means we can evaluate any candidate tree structure instantly. We do not need iterative optimization for leaf weights. This makes tree-building a pure combinatorial search over structures, evaluated by the objective formula.

Failure Mode

The formula assumes the quadratic Taylor approximation is accurate. For losses with rapidly changing curvature, the approximation can be poor. Also, if Hj+λH_j + \lambda is near zero (flat loss region with small regularization), the leaf weight blows up. The λ\lambda parameter prevents this, but setting λ\lambda too small in regions of flat loss can cause instability.

Proposition

XGBoost Split Gain

Statement

The reduction in objective from splitting a leaf node II into left ILI_L and right IRI_R children is:

Gain=12[GL2HL+λ+GR2HR+λG2H+λ]γ\text{Gain} = \frac{1}{2}\left[\frac{G_L^2}{H_L + \lambda} + \frac{G_R^2}{H_R + \lambda} - \frac{G^2}{H + \lambda}\right] - \gamma

where GL=iILgiG_L = \sum_{i \in I_L} g_i, HL=iILHiH_L = \sum_{i \in I_L} H_i (and similarly for RR and the parent). A split is made only if Gain>0\text{Gain} > 0.

Intuition

The gain measures how much the two child nodes can fit the loss better than the single parent node. The first three terms compare the objective of two leaves versus one. The γ\gamma term is the cost of adding a split (complexity penalty). If the improvement does not justify the added complexity, the split is rejected. This is built-in pruning.

Proof Sketch

From the optimal objective formula: the unsplit leaf has objective G2/(2(H+λ))-G^2 / (2(H + \lambda)). After splitting: left contributes GL2/(2(HL+λ))-G_L^2 / (2(H_L + \lambda)) and right contributes GR2/(2(HR+λ))-G_R^2 / (2(H_R + \lambda)). The number of leaves increases by 1 (cost γ\gamma). The gain is the decrease in objective: (before) - (after) =12[GL2/(HL+λ)+GR2/(HR+λ)G2/(H+λ)]γ= \frac{1}{2}[G_L^2/(H_L + \lambda) + G_R^2/(H_R + \lambda) - G^2/(H + \lambda)] - \gamma.

Why It Matters

This formula is what XGBoost evaluates millions of times during training: for every feature, for every candidate threshold, compute the gain. The feature/threshold pair with the highest gain wins. This is why efficient gain computation (via histograms) is critical for speed.

Failure Mode

The gain formula can favor splits that are good for the training data but poor for generalization, especially when λ\lambda and γ\gamma are too small. The parameters λ\lambda (leaf weight shrinkage) and γ\gamma (minimum split gain) are the primary regularization knobs and must be tuned via cross-validation.

System Optimizations

The algorithmic contribution of XGBoost is the second-order method. But the practical dominance came equally from engineering:

Histogram-based binning. Instead of evaluating every unique feature value as a split threshold, bucket continuous features into BB bins (typically 256). Split finding becomes O(nB)O(nB) per feature instead of O(nlogn)O(n \log n). LightGBM further accelerated this approach.

Column subsampling. At each tree or split level, randomly sample a fraction of features. This reduces computation, decorrelates trees (improving generalization), and is directly analogous to the random feature selection in random forests.

Sparsity-aware splitting. XGBoost handles missing values natively by learning a default branch direction (left or right) at each split. During training, it tries both directions and picks the one with higher gain. This avoids imputation and works naturally with sparse data.

Out-of-core computation. For datasets that do not fit in memory, XGBoost uses block-based storage and prefetching to stream data from disk. The histogram structure enables block-wise aggregation.

Cache-aware access. Gradient and Hessian statistics are stored in contiguous memory aligned to cache lines. The histogram accumulation loop is designed for sequential memory access, minimizing cache misses.

Canonical Examples

Example

Gradients and Hessians for common losses

Squared loss L=12(yF)2L = \frac{1}{2}(y - F)^2: gi=Fiyig_i = F_i - y_i, Hi=1H_i = 1. All Hessians are 1, so the second-order method reduces to first-order (Newton's method equals gradient descent for quadratics).

Logistic loss L=[ylogp+(1y)log(1p)]L = -[y\log p + (1-y)\log(1-p)] where p=σ(F)p = \sigma(F): gi=piyig_i = p_i - y_i, Hi=pi(1pi)H_i = p_i(1 - p_i). Hessians are small near p=0p = 0 or p=1p = 1 (confident predictions), so the leaf weight is larger where predictions are uncertain. This is where second-order information helps most.

Example

Split gain calculation

A leaf contains 6 samples. Gradients: g=(0.8,0.5,0.3,0.7,0.6,0.4)g = (-0.8, 0.5, -0.3, 0.7, -0.6, 0.4). Hessians: H=(0.2,0.2,0.2,0.2,0.2,0.2)H = (0.2, 0.2, 0.2, 0.2, 0.2, 0.2). Parameters: λ=1\lambda = 1, γ=0.5\gamma = 0.5.

Parent: G=0.1G = -0.1, H=1.2H = 1.2, score =0.12/(1.2+1)=0.0045= 0.1^2 / (1.2 + 1) = 0.0045.

Split samples 1,3,5 (left) vs 2,4,6 (right):

  • Left: GL=1.7G_L = -1.7, HL=0.6H_L = 0.6, score =1.72/1.6=1.806= 1.7^2 / 1.6 = 1.806
  • Right: GR=1.6G_R = 1.6, HR=0.6H_R = 0.6, score =1.62/1.6=1.6= 1.6^2 / 1.6 = 1.6

Gain =0.5(1.806+1.60.0045)0.5=1.2010.5=0.701= 0.5(1.806 + 1.6 - 0.0045) - 0.5 = 1.201 - 0.5 = 0.701. Since Gain >0> 0, the split is beneficial.

Common Confusions

Watch Out

XGBoost is not a new algorithm

XGBoost is gradient boosting with (1) a second-order Taylor approximation, (2) explicit tree regularization, and (3) efficient engineering. The conceptual framework is identical to Friedman's gradient boosting. The second-order information is analogous to Newton's method vs gradient descent: same direction, better step size.

Watch Out

More trees is not always better

With a small learning rate, XGBoost can benefit from thousands of trees. But at some point, overfitting begins. Early stopping on a validation set is standard practice. The optimal number of trees depends on the learning rate, max depth, and regularization parameters. There is no universal rule.

Summary

  • XGBoost uses second-order Taylor expansion: both gradients gig_i and Hessians HiH_i of the loss
  • Optimal leaf weight: wj=Gj/(Hj+λ)w_j^* = -G_j / (H_j + \lambda) (regularized Newton step)
  • Split gain: 12[GL2/(HL+λ)+GR2/(HR+λ)G2/(H+λ)]γ\frac{1}{2}[G_L^2/(H_L+\lambda) + G_R^2/(H_R+\lambda) - G^2/(H+\lambda)] - \gamma
  • Regularization: γ\gamma penalizes number of leaves, λ\lambda penalizes leaf weight magnitude
  • Histogram binning reduces split search from O(nlogn)O(n\log n) to O(nB)O(nB)
  • Column subsampling decorrelates trees and reduces computation
  • Missing values are handled by learning a default split direction

Exercises

ExerciseCore

Problem

For logistic loss with y{0,1}y \in \{0, 1\} and prediction p=σ(F)p = \sigma(F), derive the gradient gg and Hessian HH with respect to FF. What is the optimal leaf weight for a leaf containing 10 samples with predicted probabilities all equal to 0.3 and true labels split 7 negatives and 3 positives? Use λ=1\lambda = 1.

ExerciseAdvanced

Problem

Show that for squared loss L=12(yF)2L = \frac{1}{2}(y-F)^2, the XGBoost split gain formula reduces to a form that depends only on the means and counts of the target variable in the left and right children. How does this relate to the variance reduction criterion used in CART?

References

Canonical:

  • Chen & Guestrin, "XGBoost: A Scalable Tree Boosting System" (KDD 2016)
  • Friedman, "Greedy Function Approximation: A Gradient Boosting Machine" (2001)
  • Hastie, Tibshirani, Friedman, Elements of Statistical Learning (2009), Chapter 10

Current:

  • Ke et al., "LightGBM: A Highly Efficient Gradient Boosting Decision Tree" (2017)

  • Prokhorenkova et al., "CatBoost: unbiased boosting with categorical features" (2018)

  • Bishop, Pattern Recognition and Machine Learning (2006), Chapters 1-14

Next Topics

Building on XGBoost:

Last reviewed: April 2026

Prerequisites

Foundations this topic depends on.

Next Topics