Generalized Additive Models

Sneiderman, Robby

ML Methods

Generalized Additive Models

GAMs model the response as a sum of smooth univariate functions, one per predictor. Interpretable nonlinear regression with backfitting, P-splines, and partial effect plots.

CoreTier 2StableSupporting~50 min

Prerequisites

Linear Regression Mars Multivariate Adaptive Regression Splines

Start 8-question practice · 3 available Prereq Map

Learning position

Read this page in the graph.

ml-methods | layer 2 | tier 2. This page has 2 direct prerequisites and 1 published dependent.

Open Atlas Prerequisites Leads to

What next

MARS (Multivariate Adaptive Regression Splines)

This is the first curated or graph-derived continuation from the current page.

Evidence badge

Claim status

This page has no public Lean mapping yet. Use the evidence page to inspect how claim status labels work.

Show the backing system

AtlasOpen the full prerequisite graph and run grounding traces.EvidenceInspect source support, claim labels, and public trust status.LeanReview the checked declaration list, scopes, and axiom profile.

Why This Matters

Linear regression assumes the relationship between each predictor and the response is linear: $y = \beta_0 + \beta_1 x_1 + \cdots + \beta_p x_p$ . This is often wrong. The effect of age on income is not linear. The effect of temperature on crop yield is not linear.

Generalized additive models (GAMs) relax the linearity assumption while retaining the additive structure:

$y = \alpha + f_1(x_1) + f_2(x_2) + \cdots + f_p(x_p) + \epsilon$

Each $f_j$ is a smooth, potentially nonlinear function of one predictor. The additive structure means there are no interactions: each predictor's effect is independent. This makes GAMs interpretable: you can plot each $f_j$ to see exactly how predictor $j$ affects the response.

GAMs sit between linear regression (too rigid) and black-box models like random forests or neural networks (too opaque). When you need interpretability and suspect nonlinear effects, GAMs are the right tool.

Mental Model

Think of a GAM as a linear model where each coefficient $\beta_j x_j$ is replaced by a smooth curve $f_j(x_j)$ . Instead of asking "does a one-unit increase in $x_j$ increase $y$ by $\beta_j$ ?", you ask "what is the shape of the relationship between $x_j$ and $y$ ?". The curve $f_j$ answers this question directly.

Formal Setup

Definition

Generalized Additive Model

A GAM models a transformed conditional mean of $Y$ given predictors $x_1, \ldots, x_p$ through a known monotone link function $g$ :

$g\!\left(\mathbb{E}[Y \mid x_1, \ldots, x_p]\right) = \alpha + \sum_{j=1}^{p} f_j(x_j)$

where $\alpha$ is the intercept, each $f_j: \mathbb{R} \to \mathbb{R}$ is a smooth function estimated from the data, and the link $g$ is chosen to match the response distribution: identity for Gaussian regression, logit for binary classification ( $g(\mu) = \log(\mu / (1 - \mu))$ ), and log for Poisson counts ( $g(\mu) = \log \mu$ ). The "G" in GAM is exactly this link — without it, the model is just an additive model. For identifiability, we require $\mathbb{E}[f_j(X_j)] = 0$ for each $j$ .

Definition

Smoothing Spline

A smoothing spline for $f_j$ minimizes the penalized sum of squares:

$\min_{f_j} \sum_{i=1}^{n} \left(r_{ij} - f_j(x_{ij})\right)^2 + \lambda_j \int f_j''(t)^2 \, dt$

where $r_{ij}$ is the partial residual (response minus contributions of all other functions), and $\lambda_j \geq 0$ controls smoothness. When $\lambda_j = 0$ , $f_j$ interpolates the data. As $\lambda_j \to \infty$ , $f_j$ converges to a straight line (the linear regression solution). The solution is a natural cubic spline with knots at the data points.

Definition

P-Splines (Penalized B-Splines)

P-splines represent $f_j$ as a linear combination of B-spline basis functions with a penalty on the differences of adjacent coefficients:

$f_j(x) = \sum_{k=1}^{K} \beta_{jk} B_k(x), \quad \text{penalty} = \lambda_j \sum_{k=3}^{K} (\Delta^2 \beta_{jk})^2$

where $\Delta^2$ is the second-order difference operator. P-splines are computationally cheaper than smoothing splines (fixed number of basis functions $K$ rather than $n$ knots) and produce nearly identical fits. They are the standard implementation choice.

The Backfitting Algorithm

GAMs are fitted by backfitting: iteratively smoothing partial residuals.

Initialize: $\hat{\alpha} = \bar{y}$ , $\hat{f}_j = 0$ for all $j$
For each $j = 1, \ldots, p$ $j = 1, \dots, p$ :
- Compute partial residuals: $r_{ij} = y_i - \hat{\alpha} - \sum_{k \neq j} \hat{f}_k(x_{ik})$
- Smooth: $\hat{f}_j \leftarrow \text{Smooth}(r_{ij} \text{ vs } x_{ij})$
- Center: $\hat{f}_j \leftarrow \hat{f}_j - \bar{f}_j$
Repeat until convergence (changes in $\hat{f}_j$ are below tolerance)

Each step isolates the effect of one predictor by removing the estimated effects of all others, then smooths the residuals against that predictor.

Main Theorems

Theorem

Convergence of the Backfitting Algorithm

Statement

If each smoother $S_j$ is a symmetric positive semi-definite linear operator with eigenvalues in $[0, 1]$ , and the pairwise products $S_j S_k$ have spectral radius less than 1 for $j \neq k$ , then the backfitting algorithm converges to the unique minimizer of the penalized least squares criterion:

$\min_{f_1, \ldots, f_p} \sum_{i=1}^{n} \left(y_i - \alpha - \sum_{j=1}^{p} f_j(x_{ij})\right)^2 + \sum_{j=1}^{p} \lambda_j \int f_j''(t)^2 \, dt$

Convergence is geometric with rate bounded by the largest spectral radius of $S_j S_k$ over all $j \neq k$ pairs.

Intuition

Backfitting is a block coordinate descent algorithm applied to the penalized least squares objective. Each step minimizes over one $f_j$ while holding the others fixed. For convex objectives with mild coupling between blocks (the spectral radius condition), block coordinate descent converges. The spectral radius measures how strongly the smoothers for different predictors interfere with each other.

Why It Matters

This theorem guarantees that the backfitting algorithm finds the global optimum of the GAM objective. Without this guarantee, the iterative fitting procedure might oscillate or converge to a suboptimal solution. The result also gives a convergence rate, telling you how many iterations to expect.

Failure Mode

Convergence can be slow when predictors are highly correlated. If $\text{cor}(x_j, x_k) \approx 1$ , the smoothers $S_j$ and $S_k$ have nearly overlapping column spaces, and the spectral radius of $S_j S_k$ approaches 1. In the extreme (collinear predictors), the partial effects $f_j$ and $f_k$ are not identifiable: any shift from $f_j$ to $f_k$ preserves the fit.

report a correction →

Interpretability

The primary advantage of GAMs over black-box models is interpretability. For each predictor $x_j$ , plot $\hat{f}_j(x_j)$ vs. $x_j$ to visualize the partial effect. These plots show:

The direction and magnitude of each predictor's effect
Nonlinearities (thresholds, plateaus, U-shaped effects)
Confidence intervals around the estimated curve

This is more informative than a single coefficient $\beta_j$ . A coefficient tells you "more $x_j$ means more $y$ ." A partial effect plot tells you "more $x_j$ means more $y$ up to a threshold of 50, after which the effect plateaus."

When GAMs Beat Black-Box Models

GAMs outperform or match black-box models when:

The true relationship is additive. If there are no interactions, GAMs estimate each $f_j$ with full flexibility and have no interaction terms to overfit.
Interpretability is required. In regulated domains (healthcare, finance), you must explain predictions. GAM partial effect plots satisfy this requirement.
The number of predictors is moderate. With $p < 100$ predictors, GAMs work well. With $p > 1000$ , the additive structure may be too restrictive and fitting is slow.
The sample size is moderate. With $n < 10{,}000$ , GAMs are often competitive with gradient boosting because they have lower model complexity.

GAMs lose to black-box models when interactions are important (the additive assumption is wrong) or when the number of predictors is very large.

Watch Out

GAMs cannot capture interactions

The standard GAM $\alpha + \sum f_j(x_j)$ has no interaction terms. If the true relationship includes $f_{jk}(x_j, x_k)$ , the GAM will miss it. Some extensions (GA2M, tensor product interactions) add selected interaction terms, but at the cost of interpretability. If you suspect interactions, consider boosted trees or use domain knowledge to add specific interaction terms.

Watch Out

More knots does not mean more flexibility

In P-splines, adding more B-spline basis functions does not automatically increase overfitting because the penalty $\lambda_j$ controls smoothness. You can use 50 basis functions per predictor and still get a smooth fit if $\lambda_j$ is large. Smoothness is controlled by the penalty, not the number of basis functions.

Summary

GAMs: $y = \alpha + \sum f_j(x_j)$ , each $f_j$ is a smooth function estimated from data
Backfitting iteratively smooths partial residuals; converges under mild conditions
P-splines are the standard implementation: B-spline basis with a difference penalty
The penalty parameter $\lambda_j$ controls the bias-variance tradeoff for each $f_j$
Partial effect plots $\hat{f}_j(x_j)$ provide interpretable visualization of each predictor's influence
GAMs cannot capture interactions; the additive assumption is both their strength (interpretability) and weakness (expressiveness)

Exercises

ExerciseCore

Problem

Write down the GAM for modeling blood pressure as a function of age, BMI, and sodium intake. What does the constraint $\mathbb{E}[f_j(X_j)] = 0$ mean in this context?

ExerciseAdvanced

Problem

You fit a GAM with two predictors and notice the backfitting algorithm takes 200 iterations to converge. The two predictors have correlation $r = 0.95$ . Explain the connection between the correlation and the slow convergence, and propose a fix.

ExerciseResearch

Problem

GAMs assume no interactions: $\mathbb{E}[Y \mid x] = \alpha + \sum f_j(x_j)$ . Propose a method to test whether this additive assumption is violated, given a fitted GAM and a dataset.

References

Canonical:

Hastie & Tibshirani, Generalized Additive Models (Chapman and Hall, 1990)
Hastie, Tibshirani, Friedman, The Elements of Statistical Learning (2009), Chapter 9

Current:

Wood, Generalized Additive Models: An Introduction with R (2nd edition, 2017)
Lou et al., "Intelligible Models for Classification and Regression" (KDD 2012)

Next Topics

The natural next steps from GAMs:

MARS: multivariate adaptive regression splines, an alternative approach to nonlinear additive modeling using hinge functions
Bias-variance tradeoff: the formal framework for understanding the smoothing parameter $\lambda$ in GAMs

Last reviewed: April 26, 2026

Canonical graph

Required before and derived from this topic

These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.

Full prerequisite chain All derived topics

Required prerequisites

2

Linear Regressionlayer 1 · tier 1
MARS (Multivariate Adaptive Regression Splines)layer 2 · tier 3

Derived topics

1

Bias-Variance Tradeofflayer 2 · tier 2

Graph-backed continuations

Bias-Variance Tradeoff