Beta. Content is under active construction and has not been peer-reviewed. Report errors on GitHub.Disclaimer

ML Methods

MARS (Multivariate Adaptive Regression Splines)

MARS: automatically discover nonlinear relationships using piecewise linear hinge functions, forward-backward selection, and a direct connection to ReLU networks.

CoreTier 3Stable~40 min

Prerequisites

0

Why This Matters

Linear regression is limited to linear relationships. Polynomial regression can capture curvature but suffers from global effects: a high-degree polynomial oscillates everywhere to fit a local pattern. MARS (Friedman, 1991) solves this by building a model from piecewise linear basis functions that adapt to local structure in the data.

MARS is notable for two reasons. First, it automatically discovers where the breakpoints (knots) should go, selecting both the variables and the split points from data. Second, its basis functions are hinge functions: max(0,xt)\max(0, x - t) and max(0,tx)\max(0, t - x). These are exactly the ReLU activation function and its mirror. A MARS model is, in a precise sense, a single-layer ReLU network with automatically selected features.

Mental Model

Think of MARS as building a piecewise linear approximation to an unknown function. At each step, the algorithm asks: "Where in the input space would adding a new breakpoint most reduce the error?" It places a hinge at that point, creating a "kink" in the fit. The forward pass greedily adds hinge functions. The backward pass prunes unnecessary ones.

Formal Setup

Definition

Hinge Function (Basis Function)

The MARS basis functions are hinge functions (also called truncated linear functions):

h+(x,t)=max(0,xt),h(x,t)=max(0,tx)h_+(x, t) = \max(0, x - t), \quad h_-(x, t) = \max(0, t - x)

where tt is the knot location. h+h_+ is zero for xtx \leq t and linear for x>tx > t. hh_- is linear for x<tx < t and zero for xtx \geq t. Each hinge function creates a piecewise linear function with a single breakpoint at tt.

Definition

MARS Model

A MARS model has the form:

y^=β0+m=1MβmBm(x)\hat{y} = \beta_0 + \sum_{m=1}^{M} \beta_m B_m(x)

where each basis function BmB_m is a product of one or more hinge functions:

Bm(x)=k=1Kmhsmk(xv(m,k),tmk)B_m(x) = \prod_{k=1}^{K_m} h_{s_{mk}}(x_{v(m,k)}, t_{mk})

Here smk{+,}s_{mk} \in \{+, -\} is the sign, v(m,k)v(m,k) is the variable index, and tmkt_{mk} is the knot. When Km=1K_m = 1, the basis function captures a main effect. When Km2K_m \geq 2, it captures an interaction between multiple variables.

The MARS Algorithm

Forward Pass

Start with only the intercept β0\beta_0. At each step, consider adding a pair of hinge functions for every variable jj and every observed value xijx_{ij} as a candidate knot:

BM+1(x)=B(x)h+(xj,t),BM+2(x)=B(x)h(xj,t)B_{M+1}(x) = B_\ell(x) \cdot h_+(x_j, t), \quad B_{M+2}(x) = B_\ell(x) \cdot h_-(x_j, t)

where BB_\ell is an existing basis function (allowing interactions). Choose the pair that most reduces the residual sum of squares. Continue until a maximum number of terms MmaxM_{\max} is reached or the improvement falls below a threshold.

Backward Pass (Pruning)

The forward pass intentionally overfits. The backward pass removes terms one at a time, deleting the term whose removal causes the smallest increase in residual sum of squares. Use generalized cross-validation (GCV) to select the optimal model size:

GCV(M)=RSS(M)/n(1C(M)/n)2\text{GCV}(M) = \frac{\text{RSS}(M) / n}{(1 - C(M)/n)^2}

where C(M)=M+dMC(M) = M + d \cdot M accounts for the effective number of parameters (dd penalizes the knot selection process, typically d=3d = 3).

Main Theorems

Proposition

MARS Approximation Rate

Statement

For a Lipschitz continuous function f:[0,1]pRf: [0,1]^p \to \mathbb{R} with constant LL, a MARS model with MM basis functions (each using at most qq hinge function factors) achieves approximation error:

ff^MCLM1/p\|f - \hat{f}_M\|_\infty \leq C \cdot L \cdot M^{-1/p}

for a constant CC depending on pp and qq. The rate M1/pM^{-1/p} exhibits the curse of dimensionality: in high dimensions, many more basis functions are needed for the same accuracy.

Intuition

Each hinge function partitions one dimension at one point. With MM hinge functions, the input space is divided into roughly MM regions, each with a linear fit. The approximation error in each region scales with the region diameter times the Lipschitz constant. In pp dimensions, the region diameter after MM partitions scales as M1/pM^{-1/p}.

Why It Matters

This rate is the same as for any piecewise linear approximation in pp dimensions. MARS achieves the optimal rate for its function class. The curse of dimensionality in the exponent 1/p1/p means MARS works best in low to moderate dimensions (p20p \leq 20). For high-dimensional data, other methods (random forests, gradient boosting) are typically preferred.

Failure Mode

The approximation rate assumes optimally placed knots. In practice, the forward pass uses a greedy algorithm that may not find optimal knot locations. The Lipschitz assumption also excludes functions with discontinuities, where piecewise linear models must place many knots near the discontinuity.

Connection to ReLU Networks

The hinge function max(0,xt)\max(0, x - t) is a shifted ReLU. A MARS model with first-order terms (no interactions) is:

y^=β0+m=1Mβmmax(0,xv(m)tm)\hat{y} = \beta_0 + \sum_{m=1}^{M} \beta_m \max(0, x_{v(m)} - t_m)

This is exactly a single hidden layer ReLU network with MM neurons, where each neuron acts on a single input variable with a fixed bias tm-t_m. The difference: MARS selects variables and knots greedily with a backward pruning step, while neural networks optimize weights via gradient descent.

With interaction terms (Km2K_m \geq 2), MARS builds products of ReLUs. This is equivalent to a specific architecture of multiplicative interactions that standard feed-forward networks do not use.

MARS in Practice

Strengths:

  • Automatic variable selection (unimportant variables get no hinge functions)
  • Automatic nonlinearity detection (the algorithm only adds knots where needed)
  • Handles interactions if allowed, up to a specified maximum order
  • Fast to fit (greedy search is O(nMmaxp)O(n M_{\max} p) per forward step)

Weaknesses:

  • Greedy search may miss good knot placements
  • Limited to piecewise linear fits (cannot capture smooth curvature as well as splines or GAMs)
  • The curse of dimensionality limits scalability to moderate pp
Watch Out

MARS is not just linear regression with manual breakpoints

MARS automatically selects where to place breakpoints and which variables to use. You do not specify knot locations. The algorithm finds them by searching over all candidate locations (all observed data values for each variable). This is the "adaptive" in the name.

Watch Out

MARS interactions are not the same as polynomial interactions

A MARS interaction max(0,x1t1)max(0,x2t2)\max(0, x_1 - t_1) \cdot \max(0, x_2 - t_2) is locally multiplicative: it is nonzero only in the quadrant where x1>t1x_1 > t_1 and x2>t2x_2 > t_2. A polynomial interaction x1x2x_1 \cdot x_2 is global. MARS interactions are more localized and less prone to extrapolation artifacts.

Summary

  • MARS builds models from hinge functions max(0,xt)\max(0, x - t) and max(0,tx)\max(0, t - x)
  • Forward pass greedily adds basis functions; backward pass prunes via GCV
  • Hinge functions are shifted ReLUs; a MARS model is a single-layer ReLU network with selected features
  • Products of hinge functions capture interactions
  • Approximation rate M1/pM^{-1/p} exhibits the curse of dimensionality
  • Best suited for low to moderate dimensional data (p20p \leq 20)

Exercises

ExerciseCore

Problem

Write out the MARS basis function for a model with two hinge functions: max(0,age40)\max(0, \text{age} - 40) and max(0,40age)\max(0, 40 - \text{age}). Sketch the combined fit β0+β1max(0,age40)+β2max(0,40age)\beta_0 + \beta_1 \max(0, \text{age} - 40) + \beta_2 \max(0, 40 - \text{age}) with β0=50\beta_0 = 50, β1=2\beta_1 = 2, β2=1\beta_2 = -1.

ExerciseAdvanced

Problem

A MARS model with M=20M = 20 basis functions in p=5p = 5 dimensions uses the GCV criterion GCV=RSS/(n(1C(M)/n)2)\text{GCV} = \text{RSS}/(n(1 - C(M)/n)^2) with C(M)=M+3M=4MC(M) = M + 3M = 4M. With n=500n = 500, compare the effective penalty for model sizes M=5M = 5 and M=20M = 20.

References

Canonical:

  • Friedman, "Multivariate Adaptive Regression Splines" (Annals of Statistics, 1991)
  • Hastie, Tibshirani, Friedman, The Elements of Statistical Learning (2009), Chapter 9.4

Current:

  • Milborrow, Earth: Multivariate Adaptive Regression Splines (R package documentation, 2024)

  • Bishop, Pattern Recognition and Machine Learning (2006), Chapters 1-14

  • Murphy, Machine Learning: A Probabilistic Perspective (2012), Chapters 1-28

Next Topics

The natural next steps from MARS:

Last reviewed: April 2026

Prerequisites

Foundations this topic depends on.

Next Topics