ML Methods
Generalized Additive Models
GAMs: y = alpha + sum f_j(x_j) where each f_j is a smooth function. Interpretable nonlinear regression with backfitting, P-splines, and partial effect plots.
Prerequisites
Why This Matters
Linear regression assumes the relationship between each predictor and the response is linear: . This is often wrong. The effect of age on income is not linear. The effect of temperature on crop yield is not linear.
Generalized additive models (GAMs) relax the linearity assumption while retaining the additive structure:
Each is a smooth, potentially nonlinear function of one predictor. The additive structure means there are no interactions: each predictor's effect is independent. This makes GAMs interpretable: you can plot each to see exactly how predictor affects the response.
GAMs sit between linear regression (too rigid) and black-box models like random forests or neural networks (too opaque). When you need interpretability and suspect nonlinear effects, GAMs are the right tool.
Mental Model
Think of a GAM as a linear model where each coefficient is replaced by a smooth curve . Instead of asking "does a one-unit increase in increase by ?", you ask "what is the shape of the relationship between and ?". The curve answers this question directly.
Formal Setup
Generalized Additive Model
A GAM models the conditional mean of given predictors as:
where is the intercept, and each is a smooth function estimated from the data. For identifiability, we require for each .
Smoothing Spline
A smoothing spline for minimizes the penalized sum of squares:
where is the partial residual (response minus contributions of all other functions), and controls smoothness. When , interpolates the data. As , converges to a straight line (the linear regression solution). The solution is a natural cubic spline with knots at the data points.
P-Splines (Penalized B-Splines)
P-splines represent as a linear combination of B-spline basis functions with a penalty on the differences of adjacent coefficients:
where is the second-order difference operator. P-splines are computationally cheaper than smoothing splines (fixed number of basis functions rather than knots) and produce nearly identical fits. They are the standard implementation choice.
The Backfitting Algorithm
GAMs are fitted by backfitting: iteratively smoothing partial residuals.
- Initialize: , for all
- For each :
- Compute partial residuals:
- Smooth:
- Center:
- Repeat until convergence (changes in are below tolerance)
Each step isolates the effect of one predictor by removing the estimated effects of all others, then smooths the residuals against that predictor.
Main Theorems
Convergence of the Backfitting Algorithm
Statement
If each smoother is a symmetric positive semi-definite linear operator with eigenvalues in , and the pairwise products have spectral radius less than 1 for , then the backfitting algorithm converges to the unique minimizer of the penalized least squares criterion:
Convergence is geometric with rate bounded by the largest spectral radius of over all pairs.
Intuition
Backfitting is a block coordinate descent algorithm applied to the penalized least squares objective. Each step minimizes over one while holding the others fixed. For convex objectives with mild coupling between blocks (the spectral radius condition), block coordinate descent converges. The spectral radius measures how strongly the smoothers for different predictors interfere with each other.
Why It Matters
This theorem guarantees that the backfitting algorithm finds the global optimum of the GAM objective. Without this guarantee, the iterative fitting procedure might oscillate or converge to a suboptimal solution. The result also gives a convergence rate, telling you how many iterations to expect.
Failure Mode
Convergence can be slow when predictors are highly correlated. If , the smoothers and have nearly overlapping column spaces, and the spectral radius of approaches 1. In the extreme (collinear predictors), the partial effects and are not identifiable: any shift from to preserves the fit.
Interpretability
The primary advantage of GAMs over black-box models is interpretability. For each predictor , plot vs. to visualize the partial effect. These plots show:
- The direction and magnitude of each predictor's effect
- Nonlinearities (thresholds, plateaus, U-shaped effects)
- Confidence intervals around the estimated curve
This is more informative than a single coefficient . A coefficient tells you "more means more ." A partial effect plot tells you "more means more up to a threshold of 50, after which the effect plateaus."
When GAMs Beat Black-Box Models
GAMs outperform or match black-box models when:
- The true relationship is additive. If there are no interactions, GAMs estimate each with full flexibility and have no interaction terms to overfit.
- Interpretability is required. In regulated domains (healthcare, finance), you must explain predictions. GAM partial effect plots satisfy this requirement.
- The number of predictors is moderate. With predictors, GAMs work well. With , the additive structure may be too restrictive and fitting is slow.
- The sample size is moderate. With , GAMs are often competitive with gradient boosting because they have lower model complexity.
GAMs lose to black-box models when interactions are important (the additive assumption is wrong) or when the number of predictors is very large.
GAMs cannot capture interactions
The standard GAM has no interaction terms. If the true relationship includes , the GAM will miss it. Some extensions (GA2M, tensor product interactions) add selected interaction terms, but at the cost of interpretability. If you suspect interactions, consider boosted trees or use domain knowledge to add specific interaction terms.
More knots does not mean more flexibility
In P-splines, adding more B-spline basis functions does not automatically increase overfitting because the penalty controls smoothness. You can use 50 basis functions per predictor and still get a smooth fit if is large. Smoothness is controlled by the penalty, not the number of basis functions.
Summary
- GAMs: , each is a smooth function estimated from data
- Backfitting iteratively smooths partial residuals; converges under mild conditions
- P-splines are the standard implementation: B-spline basis with a difference penalty
- The penalty parameter controls the bias-variance tradeoff for each
- Partial effect plots provide interpretable visualization of each predictor's influence
- GAMs cannot capture interactions; the additive assumption is both their strength (interpretability) and weakness (expressiveness)
Exercises
Problem
Write down the GAM for modeling blood pressure as a function of age, BMI, and sodium intake. What does the constraint mean in this context?
Problem
You fit a GAM with two predictors and notice the backfitting algorithm takes 200 iterations to converge. The two predictors have correlation . Explain the connection between the correlation and the slow convergence, and propose a fix.
Problem
GAMs assume no interactions: . Propose a method to test whether this additive assumption is violated, given a fitted GAM and a dataset.
References
Canonical:
- Hastie & Tibshirani, Generalized Additive Models (Chapman and Hall, 1990)
- Hastie, Tibshirani, Friedman, The Elements of Statistical Learning (2009), Chapter 9
Current:
-
Wood, Generalized Additive Models: An Introduction with R (2nd edition, 2017)
-
Lou et al., "Intelligible Models for Classification and Regression" (KDD 2012)
-
Bishop, Pattern Recognition and Machine Learning (2006), Chapters 1-14
-
Murphy, Machine Learning: A Probabilistic Perspective (2012), Chapters 1-28
Next Topics
The natural next steps from GAMs:
- MARS: multivariate adaptive regression splines, an alternative approach to nonlinear additive modeling using hinge functions
- Bias-variance tradeoff: the formal framework for understanding the smoothing parameter in GAMs
Last reviewed: April 2026
Prerequisites
Foundations this topic depends on.
- Linear RegressionLayer 1
- Matrix Operations and PropertiesLayer 0A
- Sets, Functions, and RelationsLayer 0A
- Basic Logic and Proof TechniquesLayer 0A
- Maximum Likelihood EstimationLayer 0B
- Common Probability DistributionsLayer 0A
- Differentiation in RnLayer 0A