Smoothing Splines

Sneiderman, Robby

Optimization Function Classes

Smoothing Splines

Solve a roughness-penalized least squares: minimize residual sum of squares plus the integrated squared second derivative. The minimizer is the natural cubic spline interpolating the data, the smoothing-parameter selection has a closed form via degrees of freedom and GCV, and the estimator lives in a reproducing kernel Hilbert space.

AdvancedAdvancedTier 1StableSupporting~60 min

For:MLStatsResearch

Prerequisites

Linear Regression Ridge Regression Kernels and Rkhs Bias Variance Tradeoff

Prereq Map

Why This Matters

A spline interpolates the data with a piecewise polynomial. A smoothing spline takes the same idea and adds a roughness penalty: instead of passing through every observation, the estimator minimizes $\hat{f}_\lambda = \arg\min_{f \in W^2_2} \sum_{i=1}^n (Y_i - f(X_i))^2 + \lambda \int (f''(x))^2 \, dx$ over the second-order Sobolev space $W^2_2$ . The two terms trade off: fit the data ( $\lambda \to 0$ gives the interpolating spline) versus avoid oscillation ( $\lambda \to \infty$ forces $f$ to be affine). The smoothing parameter $\lambda$ controls the bias-variance tradeoff, and a closed-form leave-one-out and generalized cross-validation give clean data-driven choices.

The reason this earns its own page on a site that already covers ridge regression and kernels and RKHS: smoothing splines are the function-space analogue of ridge. The ridge solution is a vector that minimizes squared loss plus an $L_2$ penalty on coefficients. The smoothing spline solution is a function that minimizes squared loss plus an $L_2$ penalty on the second derivative. The solution has finite dimension by the representer theorem; the dimension equals the sample size; the linear solver is banded with bandwidth $4$ . Everything that ridge gets to do, a smoothing spline does in function space.

ESL 2nd ed. §5.4 (pp. 151-153) develops the natural-cubic-spline characterization. The RKHS view is §5.8 (pp. 167-173). Green and Silverman (1994) is the canonical monograph.

Quick Version

Object	Form
Objective	$\sum_i (Y_i - f(X_i))^2 + \lambda \int (f''(x))^2 \, dx$ over $f \in W^2_2$
Solution	natural cubic spline with knots at the $X_i$
Hat matrix	$\boldsymbol{S}_\lambda = (\boldsymbol{I} + \lambda \boldsymbol{K})^{-1}$ for a banded $\boldsymbol{K}$
Effective DF	$\mathrm{df}(\lambda) = \mathrm{tr}(\boldsymbol{S}_\lambda)$
$\lambda \to 0$	interpolating natural cubic spline; $\mathrm{df} = n$
$\lambda \to \infty$	least-squares affine fit; $\mathrm{df} = 2$
GCV	$\mathrm{GCV}(\lambda) = \frac{n^{-1} \sum_i (Y_i - \hat{f}_\lambda(X_i))^2}{(1 - n^{-1} \mathrm{tr}(\boldsymbol{S}_\lambda))^2}$
Reinsch solver	banded $O(n)$ to compute $\hat{f}_\lambda$

The "natural" in "natural cubic spline" means linearity outside the data range: $f''(x) = 0$ for $x < X_{(1)}$ and $x > X_{(n)}$ . This is forced by the roughness penalty.

Formal Setup

Definition

Sobolev Space W^2_2(a, b) $W_{2}^{2}$

$W^2_2([a, b])$ is the space of functions $f: [a, b] \to \mathbb{R}$ that are absolutely continuous with absolutely continuous derivative $f'$ and $\int_a^b (f''(x))^2 \, dx < \infty$ . The penalty $\int (f'')^2$ is finite precisely on this space; the integral is the squared $L_2$ norm of $f''$ .

Definition

Smoothing Spline $\hat{f}_{λ}$

Given $(X_1, Y_1), \ldots, (X_n, Y_n)$ with distinct $X_i \in [a, b]$ and $\lambda > 0$ , the smoothing spline of order 2 (cubic) is the minimizer of $\mathcal{L}_\lambda(f) = \sum_{i=1}^n (Y_i - f(X_i))^2 + \lambda \int_a^b (f''(x))^2 \, dx$ over $f \in W^2_2([a, b])$ . The minimizer exists and is unique; it is a natural cubic spline with knots at the distinct $X_i$ .

The same idea generalizes: penalize $\int (f^{(m)})^2$ instead, and the minimizer is a natural spline of order $2m$ (so $m = 2$ gives cubic; $m = 1$ gives linear). The cubic case is the universal default.

The Natural Cubic Spline Solution

Theorem

Solution to the Penalized Least Squares

Statement

The minimizer $\hat{f}_\lambda$ of $\mathcal{L}_\lambda$ over $W^2_2([a, b])$ is a natural cubic spline with knots at the distinct $X_i$ . In particular, $\hat{f}_\lambda$ is a piecewise cubic polynomial that is $C^2$ at each knot and affine on the two outer intervals $(-\infty, X_{(1)})$ and $(X_{(n)}, \infty)$ .

Writing $\boldsymbol{f} = (f(X_1), \ldots, f(X_n))^\top$ and parametrizing the natural cubic spline by its values at the knots, the fitted vector $\hat{\boldsymbol{f}}_\lambda = \boldsymbol{S}_\lambda \boldsymbol{Y}$ where $\boldsymbol{S}_\lambda = (\boldsymbol{I}_n + \lambda \boldsymbol{K})^{-1}$ and $\boldsymbol{K}$ is the symmetric tridiagonal-times-tridiagonal "smoother penalty" matrix defined by the integrated second-derivative form on the natural cubic spline basis.

Intuition

The minimizer must absorb the residual at each $X_i$ exactly (otherwise you could move it toward $Y_i$ a bit), subject to a penalty on roughness. The extremals of the roughness penalty are exactly cubic polynomials between knots and affine in the outer intervals (any extra curvature in the outer regions can be removed without changing the data fit and only reducing the penalty).

The matrix $\boldsymbol{K}$ is sparse: in the natural-cubic-spline basis at the knots, the penalty form has bandwidth $4$ . So the linear system $(\boldsymbol{I} + \lambda \boldsymbol{K}) \boldsymbol{f} = \boldsymbol{Y}$ solves in $O(n)$ by the Reinsch algorithm.

Why It Matters

This is the result that makes the infinite-dimensional optimization practical. Without it, "minimize over $W^2_2$ " would require some kind of discretization. With it, the discrete representation at the knots is exact: $n$ degrees of freedom, banded solver, no approximation. The analogous fact for kernel ridge regression is the representer theorem; in both cases the function-space optimization collapses to a finite-dimensional problem of size $n$ .

Failure Mode

Two failure modes. (i) Repeated $X_i$ : the natural-cubic-spline knot set is determined by distinct $X_i$ only. Replicates contribute to the residual sum of squares at the same knot. The estimator is well-defined but pre-processing into mean response per unique $X_i$ with weights is cleaner. (ii) Very small $\lambda$ : $\boldsymbol{S}_\lambda$ approaches the identity, $\hat{f}_\lambda$ approaches the interpolating spline, variance blows up. The estimator is consistent only when $\lambda$ scales correctly with $n$ , namely $\lambda \asymp n^{-4/(2m+1)}$ for an order- $m$ spline.

report a correction →

Optional ProofReinsch algorithm and the banded structureShow

ESL 2nd ed. Appendix to Ch 5 (pp. 189-191) and Green and Silverman (1994) §2.3 work this out.

Let the distinct knots be $\tau_1 < \tau_2 < \cdots < \tau_n$ . Define the forward differences $h_i = \tau_{i+1} - \tau_i$ for $i = 1, \ldots, n - 1$ . Build the tridiagonal $(n - 2) \times n$ matrix $\boldsymbol{R}$ with entries $R_{i, i} = 1 / h_i, \quad R_{i, i+1} = -(1/h_i + 1/h_{i+1}), \quad R_{i, i+2} = 1 / h_{i+1}$ and the tridiagonal $(n - 2) \times (n - 2)$ matrix $\boldsymbol{Q}$ with entries $Q_{i,i} = \tfrac{1}{3}(h_i + h_{i+1}), \quad Q_{i, i+1} = Q_{i+1, i} = \tfrac{1}{6} h_{i+1}.$ Then the penalty matrix on the values $\boldsymbol{f}$ at the knots is $\boldsymbol{K} = \boldsymbol{R}^\top \boldsymbol{Q}^{-1} \boldsymbol{R}.$

Solving $(\boldsymbol{I} + \lambda \boldsymbol{K}) \boldsymbol{f} = \boldsymbol{Y}$ naively costs $O(n^3)$ . The Reinsch trick: introduce $\boldsymbol{\gamma} = \boldsymbol{Q}^{-1} \boldsymbol{R} \boldsymbol{f}$ (the second derivatives at the interior knots). Solve the joint system $\begin{pmatrix} \boldsymbol{I} & \lambda \boldsymbol{R}^\top \\ \boldsymbol{R} & -\boldsymbol{Q}/\lambda \end{pmatrix} \begin{pmatrix} \boldsymbol{f} \\ \boldsymbol{\gamma}\end{pmatrix} = \begin{pmatrix} \boldsymbol{Y} \\ \boldsymbol{0} \end{pmatrix}.$ This is banded with bandwidth $4$ and solves in $O(n)$ .

Equivalently, $\boldsymbol{f} = \boldsymbol{Y} - \lambda \boldsymbol{R}^\top \boldsymbol{\gamma}$ and $\boldsymbol{\gamma}$ solves $(\boldsymbol{Q} + \lambda \boldsymbol{R} \boldsymbol{R}^\top) \boldsymbol{\gamma} = \boldsymbol{R} \boldsymbol{Y}$ , which is a banded $(n - 2) \times (n - 2)$ system. The constant in the $O(n)$ is small enough that smoothing splines scale to $n = 10^6$ on a laptop.

Degrees of Freedom and Bandwidth Selection

A linear smoother $\hat{\boldsymbol{f}} = \boldsymbol{S} \boldsymbol{Y}$ has an effective degrees of freedom equal to $\mathrm{tr}(\boldsymbol{S})$ . For the smoothing spline, $\mathrm{tr}(\boldsymbol{S}_\lambda)$ runs from $2$ (when $\lambda \to \infty$ , the fit is an affine line in two parameters) to $n$ (when $\lambda \to 0$ , the fit interpolates and uses all $n$ degrees of freedom). The mapping is monotone, so $\lambda$ can be selected by choosing a target $\mathrm{df}$ value.

Generalized cross-validation (Craven and Wahba, 1979). Define $\mathrm{GCV}(\lambda) = \frac{n^{-1} \sum_i (Y_i - \hat{f}_\lambda(X_i))^2}{(1 - n^{-1} \mathrm{tr}(\boldsymbol{S}_\lambda))^2}.$ This is an approximation to leave-one-out cross-validated MSE that replaces each $S_{ii}$ in the denominator by $\mathrm{tr}(\boldsymbol{S}_\lambda)/n$ . GCV is computationally cheap because each evaluation costs only one linear solve and one trace evaluation. ESL 2nd ed. §5.5 (pp. 156-161) treats GCV in detail.

LOO closed form. For any linear smoother, $\widehat{\mathrm{LOO}}(\lambda) = \frac{1}{n} \sum_{i=1}^n \left(\frac{Y_i - \hat{f}_\lambda(X_i)}{1 - S_{ii}(\lambda)}\right)^2.$ The diagonal entries $S_{ii}(\lambda)$ are available in $O(n)$ from the Reinsch decomposition. LOO is unbiased but a little noisier than GCV.

Marginal likelihood / REML. Treat the spline as the posterior mean under a Gaussian process prior with smoothness controlled by $\lambda$ . The marginal likelihood, profiled over $\lambda$ , gives a Bayesian selector. Wahba's Spline Models for Observational Data (1990) Ch 1-4 develops this view.

The RKHS Interpretation

Optional Deeper DetailSmoothing spline as kernel ridge regression on a Sobolev RKHSShow

ESL 2nd ed. §5.8 (pp. 167-173) develops this. The second-order Sobolev space $W^2_2([a, b])$ is a reproducing kernel Hilbert space when equipped with the inner product $\langle f, g \rangle = f(a) g(a) + f'(a) g'(a) + \int_a^b f''(x) g''(x) \, dx.$ The reproducing kernel is $k(x, x') = 1 + (x - a)(x' - a) + \int_a^b (x - u)_+ (x' - u)_+ \, du,$ where $(t)_+ = \max(t, 0)$ . The two-dimensional "polynomial" part $1 + (x - a)(x' - a)$ corresponds to the unpenalized null space of the penalty $\int (f'')^2$ . The integral part gives the genuine kernel.

The smoothing-spline objective is the kernel ridge regression problem $\sum_i (Y_i - f(X_i))^2 + \lambda \|f\|_K^2$ on this RKHS, with $\|f\|_K^2$ excluding the affine null space. The representer theorem gives $\hat{f}_\lambda(x) = \sum_i \alpha_i k(x, X_i) + \beta_0 + \beta_1 x$ . The $\alpha_i$ solve a kernel-ridge system. The connection makes smoothing splines a special case of kernel methods: same RKHS theory, same representer theorem, same finite-dimensional reduction.

The view extends naturally to higher dimensions: replace $\int (f'')^2$ with $\int (\nabla^2 f)^2$ and the resulting RKHS gives thin-plate splines instead.

Implementation Notes

The standard implementation is the Reinsch algorithm: $O(n)$ for one $\lambda$ , $O(n)$ per additional $\lambda$ along a path, total $O(nL)$ for $L$ values of $\lambda$ along a grid. The smoother matrix $\boldsymbol{S}_\lambda$ is dense; never form it explicitly. The trace $\mathrm{tr}(\boldsymbol{S}_\lambda)$ is computed via the LDL $^\top$ factorization of the banded system without forming $\boldsymbol{S}_\lambda$ .

In R, smooth.spline() is the standard interface; in Python, scipy.interpolate.UnivariateSpline and the pygam package both expose smoothing splines. The default in smooth.spline is GCV-selected $\lambda$ .

For multivariate inputs the natural univariate-spline structure breaks; the right generalization is thin-plate splines in dimension $d = 2$ and tensor-product splines for $d \geq 3$ .

Canonical Example

Example

Smoothing a temperature time series

A weather station records daily mean temperature on a regular grid for $n = 365$ days. The signal has a clear annual cycle plus weather noise. Fit smoothing splines at three smoothing levels.

Effective $\mathrm{df}$	$\lambda$ (data-dependent)	Visual outcome
$\mathrm{df} = 8$	large	clean smooth annual cycle; weather noise removed
$\mathrm{df} = 25$	medium	annual cycle plus larger weather events
$\mathrm{df} = 100$	small	follows daily fluctuations; the signal is in the noise

GCV selects $\mathrm{df}$ around $12$ to $15$ for typical mid-latitude data. The bandwidth-selection question is whether you want the climate signal (low df) or the weather signal (high df); GCV picks the MSE-optimal point, which is climate-plus-major-weather-events.

The Reinsch solver returns the fit in roughly $0.005$ seconds for $n = 365$ on a laptop. The same fit by naive $\boldsymbol{S}_\lambda = (\boldsymbol{I} + \lambda \boldsymbol{K})^{-1}$ costs roughly $0.05$ seconds; the difference matters at $n = 10^5$ and above.

Common Confusions

Watch Out

A smoothing spline is not the same as a regression spline

A regression spline fits an ordinary least squares regression on a fixed basis of splines with a chosen number of knots (typically much fewer than $n$ ). A smoothing spline puts a knot at every distinct $X_i$ and controls flexibility through the roughness penalty. The two are related (smoothing splines are penalized regression on the maximal natural-cubic-spline basis) but the practical workflow is different: regression splines tune the knot count; smoothing splines tune $\lambda$ .

Watch Out

The smoothing parameter and the bandwidth are not the same object

$\lambda$ in a smoothing spline and $h$ in a kernel method both control the bias-variance tradeoff, but they parametrize different things. $\lambda$ trades off data fit against integrated curvature globally; $h$ controls the width of the local averaging window. The two are related asymptotically (Silverman, 1984: a smoothing spline acts like a kernel smoother with a varying bandwidth proportional to $\lambda^{1/4} f(x)^{-1/4}$ ) but they are not interchangeable.

Watch Out

Cubic refers to the order of the polynomial, not the order of the derivative

"Cubic smoothing spline" = piecewise cubic polynomial with a penalty on the integrated squared second derivative. The penalty order $m = 2$ and the polynomial order $2m = 4$ (cubic = degree 3 = order 4) are linked but the terminology varies between texts. ESL uses "order two" to mean a penalty on the second derivative; some other texts use "order two" to mean order-two polynomial.

Exercises

ExerciseCore

Problem

Verify that the minimizer of $\int (f'')^2$ over functions in $W^2_2$ with prescribed values at two points is the linear function interpolating those values. Hence argue informally why the smoothing-spline solution is affine outside the range of the data.

ExerciseAdvanced

Problem

Show that the effective degrees of freedom $\mathrm{tr}(\boldsymbol{S}_\lambda)$ is a strictly decreasing function of $\lambda \in (0, \infty)$ , ranging from $n$ at $\lambda = 0^+$ to $2$ at $\lambda = \infty$ .

ExerciseResearch

Problem

The smoothing spline has rate $O(n^{-4/5})$ MSE for $f \in W^2_2$ with optimal $\lambda \asymp n^{-4/5}$ . State the corresponding rate for an $m$ -th-order spline (penalty $\int (f^{(m)})^2$ ) applied to a target $f \in W^m_2$ , and argue why the optimal $\lambda$ scales as $n^{-2m/(2m+1)}$ .

References

Canonical:

Hastie, Tibshirani, Friedman. The Elements of Statistical Learning, 2nd ed. Springer (2009). Ch 5 "Basis Expansions and Regularization": §5.4 "Smoothing Splines" (pp. 151-156), §5.5 "Automatic Selection of the Smoothing Parameters" (pp. 156-161), §5.8 "Regularization and Reproducing Kernel Hilbert Spaces" (pp. 167-173), Appendix on computations (pp. 189-191).
Green, P. J. and Silverman, B. W. (1994). Nonparametric Regression and Generalized Linear Models: A Roughness Penalty Approach. Chapman and Hall. The canonical monograph; covers smoothing splines, P-splines, the GLM extension, and the Bayesian view.
Wahba, G. (1990). Spline Models for Observational Data. SIAM. The RKHS / Bayesian foundation, including thin-plate splines and partial splines.

Foundational:

Reinsch, C. H. (1967). "Smoothing by Spline Functions." Numerische Mathematik 10, 177-183. The original $O(n)$ banded-solver algorithm.
Craven, P. and Wahba, G. (1979). "Smoothing Noisy Data with Spline Functions: Estimating the Correct Degree of Smoothing by the Method of Generalized Cross-Validation." Numerische Mathematik 31, 377-403. Introduces GCV.
Silverman, B. W. (1984). "Spline Smoothing: The Equivalent Variable Kernel Method." Annals of Statistics 12(3), 898-916. Equivalence with a kernel smoother of varying bandwidth.

Asymptotics:

Stone, C. J. (1982). "Optimal Global Rates of Convergence for Nonparametric Regression." Annals of Statistics 10(4), 1040-1053.
van de Geer, S. (2000). Empirical Processes in M-Estimation. Cambridge. Smoothing splines as an empirical-process problem.

Next Topics

B-splines: the numerically stable basis for cubic splines and the implementation default.
Thin-plate splines: the multivariate generalization with a Laplacian-based roughness penalty.
Ridge resolvents: the analytic structure of $\boldsymbol{S}_\lambda$ as $\lambda$ varies; same machinery for finite-dimensional ridge.
Generalized additive models: smoothing splines as the per-coordinate building block.
Local polynomial regression: the local-window alternative; Silverman's equivalent-kernel theorem connects them.

Last reviewed: May 13, 2026

Canonical graph

Required before and derived from this topic

These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.

Full prerequisite chain All derived topics

Required prerequisites

5

Linear Regressionlayer 1 · tier 1
Ridge Regressionlayer 1 · tier 1
Functional Analysis Corelayer 0B · tier 2
Bias-Variance Tradeofflayer 2 · tier 2
Kernels and Reproducing Kernel Hilbert Spaceslayer 3 · tier 2

Derived topics

2

B-Splineslayer 2 · tier 1
Thin-Plate Splineslayer 2 · tier 1

Graph-backed continuations

B-Splines Thin-Plate Splines