Beta. Content is under active construction and has not been peer-reviewed. Report errors on GitHub.Disclaimer

ML Methods

Generalized Additive Models

GAMs: y = alpha + sum f_j(x_j) where each f_j is a smooth function. Interpretable nonlinear regression with backfitting, P-splines, and partial effect plots.

CoreTier 2Stable~50 min

Prerequisites

Why This Matters

Linear regression assumes the relationship between each predictor and the response is linear: y=β0+β1x1++βpxpy = \beta_0 + \beta_1 x_1 + \cdots + \beta_p x_p. This is often wrong. The effect of age on income is not linear. The effect of temperature on crop yield is not linear.

Generalized additive models (GAMs) relax the linearity assumption while retaining the additive structure:

y=α+f1(x1)+f2(x2)++fp(xp)+ϵy = \alpha + f_1(x_1) + f_2(x_2) + \cdots + f_p(x_p) + \epsilon

Each fjf_j is a smooth, potentially nonlinear function of one predictor. The additive structure means there are no interactions: each predictor's effect is independent. This makes GAMs interpretable: you can plot each fjf_j to see exactly how predictor jj affects the response.

GAMs sit between linear regression (too rigid) and black-box models like random forests or neural networks (too opaque). When you need interpretability and suspect nonlinear effects, GAMs are the right tool.

Mental Model

Think of a GAM as a linear model where each coefficient βjxj\beta_j x_j is replaced by a smooth curve fj(xj)f_j(x_j). Instead of asking "does a one-unit increase in xjx_j increase yy by βj\beta_j?", you ask "what is the shape of the relationship between xjx_j and yy?". The curve fjf_j answers this question directly.

Formal Setup

Definition

Generalized Additive Model

A GAM models the conditional mean of YY given predictors x1,,xpx_1, \ldots, x_p as:

E[Yx1,,xp]=α+j=1pfj(xj)\mathbb{E}[Y \mid x_1, \ldots, x_p] = \alpha + \sum_{j=1}^{p} f_j(x_j)

where α\alpha is the intercept, and each fj:RRf_j: \mathbb{R} \to \mathbb{R} is a smooth function estimated from the data. For identifiability, we require E[fj(Xj)]=0\mathbb{E}[f_j(X_j)] = 0 for each jj.

Definition

Smoothing Spline

A smoothing spline for fjf_j minimizes the penalized sum of squares:

minfji=1n(rijfj(xij))2+λjfj(t)2dt\min_{f_j} \sum_{i=1}^{n} \left(r_{ij} - f_j(x_{ij})\right)^2 + \lambda_j \int f_j''(t)^2 \, dt

where rijr_{ij} is the partial residual (response minus contributions of all other functions), and λj0\lambda_j \geq 0 controls smoothness. When λj=0\lambda_j = 0, fjf_j interpolates the data. As λj\lambda_j \to \infty, fjf_j converges to a straight line (the linear regression solution). The solution is a natural cubic spline with knots at the data points.

Definition

P-Splines (Penalized B-Splines)

P-splines represent fjf_j as a linear combination of B-spline basis functions with a penalty on the differences of adjacent coefficients:

fj(x)=k=1KβjkBk(x),penalty=λjk=3K(Δ2βjk)2f_j(x) = \sum_{k=1}^{K} \beta_{jk} B_k(x), \quad \text{penalty} = \lambda_j \sum_{k=3}^{K} (\Delta^2 \beta_{jk})^2

where Δ2\Delta^2 is the second-order difference operator. P-splines are computationally cheaper than smoothing splines (fixed number of basis functions KK rather than nn knots) and produce nearly identical fits. They are the standard implementation choice.

The Backfitting Algorithm

GAMs are fitted by backfitting: iteratively smoothing partial residuals.

  1. Initialize: α^=yˉ\hat{\alpha} = \bar{y}, f^j=0\hat{f}_j = 0 for all jj
  2. For each j=1,,pj = 1, \ldots, p:
    • Compute partial residuals: rij=yiα^kjf^k(xik)r_{ij} = y_i - \hat{\alpha} - \sum_{k \neq j} \hat{f}_k(x_{ik})
    • Smooth: f^jSmooth(rij vs xij)\hat{f}_j \leftarrow \text{Smooth}(r_{ij} \text{ vs } x_{ij})
    • Center: f^jf^jfˉj\hat{f}_j \leftarrow \hat{f}_j - \bar{f}_j
  3. Repeat until convergence (changes in f^j\hat{f}_j are below tolerance)

Each step isolates the effect of one predictor by removing the estimated effects of all others, then smooths the residuals against that predictor.

Main Theorems

Theorem

Convergence of the Backfitting Algorithm

Statement

If each smoother SjS_j is a symmetric positive semi-definite linear operator with eigenvalues in [0,1][0, 1], and the pairwise products SjSkS_j S_k have spectral radius less than 1 for jkj \neq k, then the backfitting algorithm converges to the unique minimizer of the penalized least squares criterion:

minf1,,fpi=1n(yiαj=1pfj(xij))2+j=1pλjfj(t)2dt\min_{f_1, \ldots, f_p} \sum_{i=1}^{n} \left(y_i - \alpha - \sum_{j=1}^{p} f_j(x_{ij})\right)^2 + \sum_{j=1}^{p} \lambda_j \int f_j''(t)^2 \, dt

Convergence is geometric with rate bounded by the largest spectral radius of SjSkS_j S_k over all jkj \neq k pairs.

Intuition

Backfitting is a block coordinate descent algorithm applied to the penalized least squares objective. Each step minimizes over one fjf_j while holding the others fixed. For convex objectives with mild coupling between blocks (the spectral radius condition), block coordinate descent converges. The spectral radius measures how strongly the smoothers for different predictors interfere with each other.

Why It Matters

This theorem guarantees that the backfitting algorithm finds the global optimum of the GAM objective. Without this guarantee, the iterative fitting procedure might oscillate or converge to a suboptimal solution. The result also gives a convergence rate, telling you how many iterations to expect.

Failure Mode

Convergence can be slow when predictors are highly correlated. If cor(xj,xk)1\text{cor}(x_j, x_k) \approx 1, the smoothers SjS_j and SkS_k have nearly overlapping column spaces, and the spectral radius of SjSkS_j S_k approaches 1. In the extreme (collinear predictors), the partial effects fjf_j and fkf_k are not identifiable: any shift from fjf_j to fkf_k preserves the fit.

Interpretability

The primary advantage of GAMs over black-box models is interpretability. For each predictor xjx_j, plot f^j(xj)\hat{f}_j(x_j) vs. xjx_j to visualize the partial effect. These plots show:

  • The direction and magnitude of each predictor's effect
  • Nonlinearities (thresholds, plateaus, U-shaped effects)
  • Confidence intervals around the estimated curve

This is more informative than a single coefficient βj\beta_j. A coefficient tells you "more xjx_j means more yy." A partial effect plot tells you "more xjx_j means more yy up to a threshold of 50, after which the effect plateaus."

When GAMs Beat Black-Box Models

GAMs outperform or match black-box models when:

  1. The true relationship is additive. If there are no interactions, GAMs estimate each fjf_j with full flexibility and have no interaction terms to overfit.
  2. Interpretability is required. In regulated domains (healthcare, finance), you must explain predictions. GAM partial effect plots satisfy this requirement.
  3. The number of predictors is moderate. With p<100p < 100 predictors, GAMs work well. With p>1000p > 1000, the additive structure may be too restrictive and fitting is slow.
  4. The sample size is moderate. With n<10,000n < 10{,}000, GAMs are often competitive with gradient boosting because they have lower model complexity.

GAMs lose to black-box models when interactions are important (the additive assumption is wrong) or when the number of predictors is very large.

Watch Out

GAMs cannot capture interactions

The standard GAM α+fj(xj)\alpha + \sum f_j(x_j) has no interaction terms. If the true relationship includes fjk(xj,xk)f_{jk}(x_j, x_k), the GAM will miss it. Some extensions (GA2M, tensor product interactions) add selected interaction terms, but at the cost of interpretability. If you suspect interactions, consider boosted trees or use domain knowledge to add specific interaction terms.

Watch Out

More knots does not mean more flexibility

In P-splines, adding more B-spline basis functions does not automatically increase overfitting because the penalty λj\lambda_j controls smoothness. You can use 50 basis functions per predictor and still get a smooth fit if λj\lambda_j is large. Smoothness is controlled by the penalty, not the number of basis functions.

Summary

  • GAMs: y=α+fj(xj)y = \alpha + \sum f_j(x_j), each fjf_j is a smooth function estimated from data
  • Backfitting iteratively smooths partial residuals; converges under mild conditions
  • P-splines are the standard implementation: B-spline basis with a difference penalty
  • The penalty parameter λj\lambda_j controls the bias-variance tradeoff for each fjf_j
  • Partial effect plots f^j(xj)\hat{f}_j(x_j) provide interpretable visualization of each predictor's influence
  • GAMs cannot capture interactions; the additive assumption is both their strength (interpretability) and weakness (expressiveness)

Exercises

ExerciseCore

Problem

Write down the GAM for modeling blood pressure as a function of age, BMI, and sodium intake. What does the constraint E[fj(Xj)]=0\mathbb{E}[f_j(X_j)] = 0 mean in this context?

ExerciseAdvanced

Problem

You fit a GAM with two predictors and notice the backfitting algorithm takes 200 iterations to converge. The two predictors have correlation r=0.95r = 0.95. Explain the connection between the correlation and the slow convergence, and propose a fix.

ExerciseResearch

Problem

GAMs assume no interactions: E[Yx]=α+fj(xj)\mathbb{E}[Y \mid x] = \alpha + \sum f_j(x_j). Propose a method to test whether this additive assumption is violated, given a fitted GAM and a dataset.

References

Canonical:

  • Hastie & Tibshirani, Generalized Additive Models (Chapman and Hall, 1990)
  • Hastie, Tibshirani, Friedman, The Elements of Statistical Learning (2009), Chapter 9

Current:

  • Wood, Generalized Additive Models: An Introduction with R (2nd edition, 2017)

  • Lou et al., "Intelligible Models for Classification and Regression" (KDD 2012)

  • Bishop, Pattern Recognition and Machine Learning (2006), Chapters 1-14

  • Murphy, Machine Learning: A Probabilistic Perspective (2012), Chapters 1-28

Next Topics

The natural next steps from GAMs:

  • MARS: multivariate adaptive regression splines, an alternative approach to nonlinear additive modeling using hinge functions
  • Bias-variance tradeoff: the formal framework for understanding the smoothing parameter λ\lambda in GAMs

Last reviewed: April 2026

Prerequisites

Foundations this topic depends on.

Next Topics