MARS (Multivariate Adaptive Regression Splines)

Sneiderman, Robby

ML Methods

MARS (Multivariate Adaptive Regression Splines)

MARS: automatically discover nonlinear relationships using piecewise linear hinge functions, forward-backward selection, and a direct connection to ReLU networks.

CoreTier 3StableSupporting~40 min

Prerequisites

Linear Regression

Prereq Map

Learning position

Read this page in the graph.

ml-methods | layer 2 | tier 3. This page has 1 direct prerequisite and 2 published dependents.

Open Atlas Prerequisites Leads to

What next

Generalized Additive Models

This is the first curated or graph-derived continuation from the current page.

Evidence badge

Claim status

This page has no public Lean mapping yet. Use the evidence page to inspect how claim status labels work.

Show the backing system

AtlasOpen the full prerequisite graph and run grounding traces.EvidenceInspect source support, claim labels, and public trust status.LeanReview the checked declaration list, scopes, and axiom profile.

Why This Matters

Linear regression is limited to linear relationships. Polynomial regression can capture curvature but suffers from global effects: a high-degree polynomial oscillates everywhere to fit a local pattern. MARS (Friedman, 1991) solves this by building a model from piecewise linear basis functions that adapt to local structure in the data.

MARS is notable for two reasons. First, it automatically discovers where the breakpoints (knots) should go, selecting both the variables and the split points from data. Second, its basis functions are hinge functions: $\max(0, x - t)$ and $\max(0, t - x)$ . These are exactly the ReLU activation function and its mirror. A MARS model is, in a precise sense, a single-layer ReLU network with automatically selected features.

Mental Model

Think of MARS as building a piecewise linear approximation to an unknown function. At each step, the algorithm asks: "Where in the input space would adding a new breakpoint most reduce the error?" It places a hinge at that point, creating a "kink" in the fit. The forward pass greedily adds hinge functions. The backward pass prunes unnecessary ones.

Formal Setup

Definition

Hinge Function (Basis Function)

The MARS basis functions are hinge functions (also called truncated linear functions):

$h_+(x, t) = \max(0, x - t), \quad h_-(x, t) = \max(0, t - x)$

where $t$ is the knot location. $h_+$ is zero for $x \leq t$ and linear for $x > t$ . $h_-$ is linear for $x < t$ and zero for $x \geq t$ . Each hinge function creates a piecewise linear function with a single breakpoint at $t$ .

Definition

MARS Model

A MARS model has the form:

$\hat{y} = \beta_0 + \sum_{m=1}^{M} \beta_m B_m(x)$

where each basis function $B_m$ is a product of one or more hinge functions:

$B_m(x) = \prod_{k=1}^{K_m} h_{s_{mk}}(x_{v(m,k)}, t_{mk})$

Here $s_{mk} \in \{+, -\}$ is the sign, $v(m,k)$ is the variable index, and $t_{mk}$ is the knot. When $K_m = 1$ , the basis function captures a main effect. When $K_m \geq 2$ , it captures an interaction between multiple variables.

The MARS Algorithm

Forward Pass

Start with only the intercept $\beta_0$ . At each step, consider adding a pair of hinge functions for every variable $j$ and every observed value $x_{ij}$ as a candidate knot:

$B_{M+1}(x) = B_\ell(x) \cdot h_+(x_j, t), \quad B_{M+2}(x) = B_\ell(x) \cdot h_-(x_j, t)$

where $B_\ell$ is an existing basis function (allowing interactions). Choose the pair that most reduces the residual sum of squares. Continue until a maximum number of terms $M_{\max}$ is reached or the improvement falls below a threshold.

Backward Pass (Pruning)

The forward pass intentionally overfits. The backward pass removes terms one at a time, deleting the term whose removal causes the smallest increase in residual sum of squares. Use generalized cross-validation (GCV) to select the optimal model size:

$\text{GCV}(M) = \frac{\text{RSS}(M) / n}{(1 - C(M)/n)^2}$

where $C(M) = M + d \cdot M$ accounts for the effective number of parameters ( $d$ penalizes the knot selection process, typically $d = 3$ ).

Main Theorems

Proposition

MARS Approximation Rate

Statement

For a Lipschitz continuous function $f: [0,1]^p \to \mathbb{R}$ with constant $L$ , a MARS model with $M$ basis functions (each using at most $q$ hinge function factors) achieves approximation error:

$\|f - \hat{f}_M\|_\infty \leq C \cdot L \cdot M^{-1/p}$

for a constant $C$ depending on $p$ and $q$ . The rate $M^{-1/p}$ exhibits the curse of dimensionality: in high dimensions, many more basis functions are needed for the same accuracy.

Intuition

Each hinge function partitions one dimension at one point. With $M$ hinge functions, the input space is divided into roughly $M$ regions, each with a linear fit. The approximation error in each region scales with the region diameter times the Lipschitz constant. In $p$ dimensions, the region diameter after $M$ partitions scales as $M^{-1/p}$ .

Why It Matters

This rate is the same as for any piecewise linear approximation in $p$ dimensions. MARS achieves the optimal rate for its function class. The curse of dimensionality in the exponent $1/p$ means MARS works best in low to moderate dimensions ( $p \leq 20$ ). For high-dimensional data, other methods (random forests, gradient boosting) are typically preferred.

Failure Mode

The approximation rate assumes optimally placed knots. In practice, the forward pass uses a greedy algorithm that may not find optimal knot locations. The Lipschitz assumption also excludes functions with discontinuities, where piecewise linear models must place many knots near the discontinuity.

report a correction →

Connection to ReLU Networks

The hinge function $\max(0, x - t)$ is a shifted ReLU. A MARS model with first-order terms (no interactions) is:

$\hat{y} = \beta_0 + \sum_{m=1}^{M} \beta_m \max(0, x_{v(m)} - t_m)$

This is exactly a single hidden layer ReLU network with $M$ neurons, where each neuron acts on a single input variable with a fixed bias $-t_m$ . The difference: MARS selects variables and knots greedily with a backward pruning step, while neural networks optimize weights via gradient descent.

With interaction terms ( $K_m \geq 2$ ), MARS builds products of ReLUs. This is equivalent to a specific architecture of multiplicative interactions that standard feed-forward networks do not use.

MARS in Practice

Strengths:

Automatic variable selection (unimportant variables get no hinge functions)
Automatic nonlinearity detection (the algorithm only adds knots where needed)
Handles interactions if allowed, up to a specified maximum order
Fast to fit (greedy search is $O(n M_{\max} p)$ per forward step)

Weaknesses:

Greedy search may miss good knot placements
Limited to piecewise linear fits (cannot capture smooth curvature as well as splines or GAMs)
The curse of dimensionality limits scalability to moderate $p$

Watch Out

MARS is not just linear regression with manual breakpoints

MARS automatically selects where to place breakpoints and which variables to use. You do not specify knot locations. The algorithm finds them by searching over all candidate locations (all observed data values for each variable). This is the "adaptive" in the name.

Watch Out

MARS interactions are not the same as polynomial interactions

A MARS interaction $\max(0, x_1 - t_1) \cdot \max(0, x_2 - t_2)$ is locally multiplicative: it is nonzero only in the quadrant where $x_1 > t_1$ and $x_2 > t_2$ . A polynomial interaction $x_1 \cdot x_2$ is global. MARS interactions are more localized and less prone to extrapolation artifacts.

Summary

MARS builds models from hinge functions $\max(0, x - t)$ and $\max(0, t - x)$
Forward pass greedily adds basis functions; backward pass prunes via GCV
Hinge functions are shifted ReLUs; a MARS model is a single-layer ReLU network with selected features
Products of hinge functions capture interactions
Approximation rate $M^{-1/p}$ exhibits the curse of dimensionality
Best suited for low to moderate dimensional data ( $p \leq 20$ )

Exercises

ExerciseCore

Problem

Write out the MARS basis function for a model with two hinge functions: $\max(0, \text{age} - 40)$ and $\max(0, 40 - \text{age})$ . Sketch the combined fit $\beta_0 + \beta_1 \max(0, \text{age} - 40) + \beta_2 \max(0, 40 - \text{age})$ with $\beta_0 = 50$ , $\beta_1 = 2$ , $\beta_2 = -1$ .

ExerciseAdvanced

Problem

A MARS model with $M = 20$ basis functions in $p = 5$ dimensions uses the GCV criterion $\text{GCV} = \text{RSS}/(n(1 - C(M)/n)^2)$ with $C(M) = M + 3M = 4M$ . With $n = 500$ , compare the effective penalty for model sizes $M = 5$ and $M = 20$ .

References

Canonical:

Friedman, "Multivariate Adaptive Regression Splines" (Annals of Statistics, 1991)
Hastie, Tibshirani, Friedman, The Elements of Statistical Learning (2009), Chapter 9.4

Current:

Milborrow, Earth: Multivariate Adaptive Regression Splines (R package documentation, 2024)
Hastie, Tibshirani, Generalized Additive Models (1990), Chapter 1: adaptive basis function motivation and spline construction
Murphy, Machine Learning: A Probabilistic Perspective (2012), Chapter 16.3: adaptive basis function models and MARS

Next Topics

The natural next steps from MARS:

Generalized additive models: smooth nonlinear regression using splines rather than hinge functions
Feedforward networks and backpropagation: the connection between MARS hinge functions and ReLU activations in neural networks

Last reviewed: April 13, 2026

Canonical graph

Required before and derived from this topic

These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.

Full prerequisite chain All derived topics

Required prerequisites

1

Linear Regressionlayer 1 · tier 1

Derived topics

2

Feedforward Networks and Backpropagationlayer 2 · tier 1
Generalized Additive Modelslayer 2 · tier 2

Graph-backed continuations

Generalized Additive Models Feedforward Networks and Backpropagation