ML Methods
MARS (Multivariate Adaptive Regression Splines)
MARS: automatically discover nonlinear relationships using piecewise linear hinge functions, forward-backward selection, and a direct connection to ReLU networks.
Prerequisites
Why This Matters
Linear regression is limited to linear relationships. Polynomial regression can capture curvature but suffers from global effects: a high-degree polynomial oscillates everywhere to fit a local pattern. MARS (Friedman, 1991) solves this by building a model from piecewise linear basis functions that adapt to local structure in the data.
MARS is notable for two reasons. First, it automatically discovers where the breakpoints (knots) should go, selecting both the variables and the split points from data. Second, its basis functions are hinge functions: and . These are exactly the ReLU activation function and its mirror. A MARS model is, in a precise sense, a single-layer ReLU network with automatically selected features.
Mental Model
Think of MARS as building a piecewise linear approximation to an unknown function. At each step, the algorithm asks: "Where in the input space would adding a new breakpoint most reduce the error?" It places a hinge at that point, creating a "kink" in the fit. The forward pass greedily adds hinge functions. The backward pass prunes unnecessary ones.
Formal Setup
Hinge Function (Basis Function)
The MARS basis functions are hinge functions (also called truncated linear functions):
where is the knot location. is zero for and linear for . is linear for and zero for . Each hinge function creates a piecewise linear function with a single breakpoint at .
MARS Model
A MARS model has the form:
where each basis function is a product of one or more hinge functions:
Here is the sign, is the variable index, and is the knot. When , the basis function captures a main effect. When , it captures an interaction between multiple variables.
The MARS Algorithm
Forward Pass
Start with only the intercept . At each step, consider adding a pair of hinge functions for every variable and every observed value as a candidate knot:
where is an existing basis function (allowing interactions). Choose the pair that most reduces the residual sum of squares. Continue until a maximum number of terms is reached or the improvement falls below a threshold.
Backward Pass (Pruning)
The forward pass intentionally overfits. The backward pass removes terms one at a time, deleting the term whose removal causes the smallest increase in residual sum of squares. Use generalized cross-validation (GCV) to select the optimal model size:
where accounts for the effective number of parameters ( penalizes the knot selection process, typically ).
Main Theorems
MARS Approximation Rate
Statement
For a Lipschitz continuous function with constant , a MARS model with basis functions (each using at most hinge function factors) achieves approximation error:
for a constant depending on and . The rate exhibits the curse of dimensionality: in high dimensions, many more basis functions are needed for the same accuracy.
Intuition
Each hinge function partitions one dimension at one point. With hinge functions, the input space is divided into roughly regions, each with a linear fit. The approximation error in each region scales with the region diameter times the Lipschitz constant. In dimensions, the region diameter after partitions scales as .
Why It Matters
This rate is the same as for any piecewise linear approximation in dimensions. MARS achieves the optimal rate for its function class. The curse of dimensionality in the exponent means MARS works best in low to moderate dimensions (). For high-dimensional data, other methods (random forests, gradient boosting) are typically preferred.
Failure Mode
The approximation rate assumes optimally placed knots. In practice, the forward pass uses a greedy algorithm that may not find optimal knot locations. The Lipschitz assumption also excludes functions with discontinuities, where piecewise linear models must place many knots near the discontinuity.
Connection to ReLU Networks
The hinge function is a shifted ReLU. A MARS model with first-order terms (no interactions) is:
This is exactly a single hidden layer ReLU network with neurons, where each neuron acts on a single input variable with a fixed bias . The difference: MARS selects variables and knots greedily with a backward pruning step, while neural networks optimize weights via gradient descent.
With interaction terms (), MARS builds products of ReLUs. This is equivalent to a specific architecture of multiplicative interactions that standard feed-forward networks do not use.
MARS in Practice
Strengths:
- Automatic variable selection (unimportant variables get no hinge functions)
- Automatic nonlinearity detection (the algorithm only adds knots where needed)
- Handles interactions if allowed, up to a specified maximum order
- Fast to fit (greedy search is per forward step)
Weaknesses:
- Greedy search may miss good knot placements
- Limited to piecewise linear fits (cannot capture smooth curvature as well as splines or GAMs)
- The curse of dimensionality limits scalability to moderate
MARS is not just linear regression with manual breakpoints
MARS automatically selects where to place breakpoints and which variables to use. You do not specify knot locations. The algorithm finds them by searching over all candidate locations (all observed data values for each variable). This is the "adaptive" in the name.
MARS interactions are not the same as polynomial interactions
A MARS interaction is locally multiplicative: it is nonzero only in the quadrant where and . A polynomial interaction is global. MARS interactions are more localized and less prone to extrapolation artifacts.
Summary
- MARS builds models from hinge functions and
- Forward pass greedily adds basis functions; backward pass prunes via GCV
- Hinge functions are shifted ReLUs; a MARS model is a single-layer ReLU network with selected features
- Products of hinge functions capture interactions
- Approximation rate exhibits the curse of dimensionality
- Best suited for low to moderate dimensional data ()
Exercises
Problem
Write out the MARS basis function for a model with two hinge functions: and . Sketch the combined fit with , , .
Problem
A MARS model with basis functions in dimensions uses the GCV criterion with . With , compare the effective penalty for model sizes and .
References
Canonical:
- Friedman, "Multivariate Adaptive Regression Splines" (Annals of Statistics, 1991)
- Hastie, Tibshirani, Friedman, The Elements of Statistical Learning (2009), Chapter 9.4
Current:
-
Milborrow, Earth: Multivariate Adaptive Regression Splines (R package documentation, 2024)
-
Bishop, Pattern Recognition and Machine Learning (2006), Chapters 1-14
-
Murphy, Machine Learning: A Probabilistic Perspective (2012), Chapters 1-28
Next Topics
The natural next steps from MARS:
- Generalized additive models: smooth nonlinear regression using splines rather than hinge functions
- Feedforward networks and backpropagation: the connection between MARS hinge functions and ReLU activations in neural networks
Last reviewed: April 2026
Prerequisites
Foundations this topic depends on.
- Linear RegressionLayer 1
- Matrix Operations and PropertiesLayer 0A
- Sets, Functions, and RelationsLayer 0A
- Basic Logic and Proof TechniquesLayer 0A
- Maximum Likelihood EstimationLayer 0B
- Common Probability DistributionsLayer 0A
- Differentiation in RnLayer 0A