ML Methods
XGBoost
XGBoost as second-order gradient boosting: Taylor expansion of the loss, regularized objective, optimal leaf weights, split gain formula, and the system optimizations that made it dominant on tabular data.
Prerequisites
Why This Matters
XGBoost is the most successful algorithm for structured (tabular) data. It dominated Kaggle competitions for years and remains a production standard. Understanding XGBoost means understanding three things:
- The math: second-order Taylor expansion of the loss gives better splits and leaf weights than first-order gradient boosting
- The regularization: explicit L1 and L2 penalties on the tree structure control overfitting
- The engineering: histogram-based binning, column subsampling, cache-aware access, and out-of-core computation make it fast at scale
XGBoost is not a new algorithm. It is gradient boosting done carefully, with second-order information and principled regularization. Understanding the math explains why it works better; understanding the systems explains why it runs faster.
Mental Model
Standard gradient boosting fits each tree to the negative gradient (first-order information only). This is like using gradient descent. XGBoost fits each tree using both the gradient and the Hessian (second-order information). This is like using Newton's method. The Hessian tells you how much to trust the gradient: where the loss is sharply curved, take smaller steps; where it is flat, take larger steps.
On top of this, XGBoost adds an explicit penalty on the number of leaves and the magnitude of leaf weights, which directly controls model complexity.
Formal Setup
At round , we add a tree to the ensemble . The objective is:
Second-Order Taylor Approximation
Expanding around to second order:
where:
- is the gradient (first derivative)
- is the Hessian (second derivative)
Dropping the constant term , we optimize:
Regularized Tree Complexity
XGBoost defines tree complexity as:
where is the number of leaf nodes and is the weight (prediction value) of leaf . The parameter penalizes the number of leaves (tree size) and penalizes the magnitude of leaf weights (analogous to L2 regularization on parameters).
An optional L1 penalty can be added for sparsity in leaf weights.
Main Theorems
Optimal Leaf Weight
Statement
For a tree with fixed structure, if leaf contains the set of samples , the optimal leaf weight minimizing is:
The corresponding minimum objective value is:
Intuition
The optimal leaf weight is a ratio: the sum of gradients (what direction to push) divided by the sum of Hessians plus regularization (how much to trust that direction). Large Hessians mean the loss is sharply curved, so the optimal step is small. The regularization further shrinks the weight, preventing any leaf from making too large a prediction.
Compare with Newton's method update: . The leaf weight is exactly a regularized Newton step.
Proof Sketch
Group samples by their assigned leaf. For leaf , the contribution to the objective is:
where and . This is a quadratic in with minimum at . Substituting back: minimum value . Summing over leaves gives the result.
Why It Matters
This closed-form formula means we can evaluate any candidate tree structure instantly. We do not need iterative optimization for leaf weights. This makes tree-building a pure combinatorial search over structures, evaluated by the objective formula.
Failure Mode
The formula assumes the quadratic Taylor approximation is accurate. For losses with rapidly changing curvature, the approximation can be poor. Also, if is near zero (flat loss region with small regularization), the leaf weight blows up. The parameter prevents this, but setting too small in regions of flat loss can cause instability.
XGBoost Split Gain
Statement
The reduction in objective from splitting a leaf node into left and right children is:
where , (and similarly for and the parent). A split is made only if .
Intuition
The gain measures how much the two child nodes can fit the loss better than the single parent node. The first three terms compare the objective of two leaves versus one. The term is the cost of adding a split (complexity penalty). If the improvement does not justify the added complexity, the split is rejected. This is built-in pruning.
Proof Sketch
From the optimal objective formula: the unsplit leaf has objective . After splitting: left contributes and right contributes . The number of leaves increases by 1 (cost ). The gain is the decrease in objective: (before) (after) .
Why It Matters
This formula is what XGBoost evaluates millions of times during training: for every feature, for every candidate threshold, compute the gain. The feature/threshold pair with the highest gain wins. This is why efficient gain computation (via histograms) is critical for speed.
Failure Mode
The gain formula can favor splits that are good for the training data but poor for generalization, especially when and are too small. The parameters (leaf weight shrinkage) and (minimum split gain) are the primary regularization knobs and must be tuned via cross-validation.
System Optimizations
The algorithmic contribution of XGBoost is the second-order method. But the practical dominance came equally from engineering:
Histogram-based binning. Instead of evaluating every unique feature value as a split threshold, bucket continuous features into bins (typically 256). Split finding becomes per feature instead of . LightGBM further accelerated this approach.
Column subsampling. At each tree or split level, randomly sample a fraction of features. This reduces computation, decorrelates trees (improving generalization), and is directly analogous to the random feature selection in random forests.
Sparsity-aware splitting. XGBoost handles missing values natively by learning a default branch direction (left or right) at each split. During training, it tries both directions and picks the one with higher gain. This avoids imputation and works naturally with sparse data.
Out-of-core computation. For datasets that do not fit in memory, XGBoost uses block-based storage and prefetching to stream data from disk. The histogram structure enables block-wise aggregation.
Cache-aware access. Gradient and Hessian statistics are stored in contiguous memory aligned to cache lines. The histogram accumulation loop is designed for sequential memory access, minimizing cache misses.
Canonical Examples
Gradients and Hessians for common losses
Squared loss : , . All Hessians are 1, so the second-order method reduces to first-order (Newton's method equals gradient descent for quadratics).
Logistic loss where : , . Hessians are small near or (confident predictions), so the leaf weight is larger where predictions are uncertain. This is where second-order information helps most.
Split gain calculation
A leaf contains 6 samples. Gradients: . Hessians: . Parameters: , .
Parent: , , score .
Split samples 1,3,5 (left) vs 2,4,6 (right):
- Left: , , score
- Right: , , score
Gain . Since Gain , the split is beneficial.
Common Confusions
XGBoost is not a new algorithm
XGBoost is gradient boosting with (1) a second-order Taylor approximation, (2) explicit tree regularization, and (3) efficient engineering. The conceptual framework is identical to Friedman's gradient boosting. The second-order information is analogous to Newton's method vs gradient descent: same direction, better step size.
More trees is not always better
With a small learning rate, XGBoost can benefit from thousands of trees. But at some point, overfitting begins. Early stopping on a validation set is standard practice. The optimal number of trees depends on the learning rate, max depth, and regularization parameters. There is no universal rule.
Summary
- XGBoost uses second-order Taylor expansion: both gradients and Hessians of the loss
- Optimal leaf weight: (regularized Newton step)
- Split gain:
- Regularization: penalizes number of leaves, penalizes leaf weight magnitude
- Histogram binning reduces split search from to
- Column subsampling decorrelates trees and reduces computation
- Missing values are handled by learning a default split direction
Exercises
Problem
For logistic loss with and prediction , derive the gradient and Hessian with respect to . What is the optimal leaf weight for a leaf containing 10 samples with predicted probabilities all equal to 0.3 and true labels split 7 negatives and 3 positives? Use .
Problem
Show that for squared loss , the XGBoost split gain formula reduces to a form that depends only on the means and counts of the target variable in the left and right children. How does this relate to the variance reduction criterion used in CART?
References
Canonical:
- Chen & Guestrin, "XGBoost: A Scalable Tree Boosting System" (KDD 2016)
- Friedman, "Greedy Function Approximation: A Gradient Boosting Machine" (2001)
- Hastie, Tibshirani, Friedman, Elements of Statistical Learning (2009), Chapter 10
Current:
-
Ke et al., "LightGBM: A Highly Efficient Gradient Boosting Decision Tree" (2017)
-
Prokhorenkova et al., "CatBoost: unbiased boosting with categorical features" (2018)
-
Bishop, Pattern Recognition and Machine Learning (2006), Chapters 1-14
Next Topics
Building on XGBoost:
- Regularization theory: understanding L1, L2, and structural penalties as a unified framework
- Cross-validation theory: principled selection of hyperparameters like learning rate, max depth, and regularization strength
Last reviewed: April 2026
Prerequisites
Foundations this topic depends on.
- Gradient BoostingLayer 2
- Decision Trees and EnsemblesLayer 2
- Empirical Risk MinimizationLayer 2
- Concentration InequalitiesLayer 1
- Common Probability DistributionsLayer 0A
- Sets, Functions, and RelationsLayer 0A
- Basic Logic and Proof TechniquesLayer 0A
- Expectation, Variance, Covariance, and MomentsLayer 0A
- Bias-Variance TradeoffLayer 2
- Gradient Descent VariantsLayer 1
- Convex Optimization BasicsLayer 1
- Differentiation in RnLayer 0A
- Matrix Operations and PropertiesLayer 0A