Skip to main content

Optimization Function Classes

Smoothing Splines

Solve a roughness-penalized least squares: minimize residual sum of squares plus the integrated squared second derivative. The minimizer is the natural cubic spline interpolating the data, the smoothing-parameter selection has a closed form via degrees of freedom and GCV, and the estimator lives in a reproducing kernel Hilbert space.

AdvancedAdvancedTier 1StableSupporting~60 min
For:MLStatsResearch

Why This Matters

A spline interpolates the data with a piecewise polynomial. A smoothing spline takes the same idea and adds a roughness penalty: instead of passing through every observation, the estimator minimizes f^λ=argminfW22i=1n(Yif(Xi))2+λ(f(x))2dx\hat{f}_\lambda = \arg\min_{f \in W^2_2} \sum_{i=1}^n (Y_i - f(X_i))^2 + \lambda \int (f''(x))^2 \, dx over the second-order Sobolev space W22W^2_2. The two terms trade off: fit the data (λ0\lambda \to 0 gives the interpolating spline) versus avoid oscillation (λ\lambda \to \infty forces ff to be affine). The smoothing parameter λ\lambda controls the bias-variance tradeoff, and a closed-form leave-one-out and generalized cross-validation give clean data-driven choices.

The reason this earns its own page on a site that already covers ridge regression and kernels and RKHS: smoothing splines are the function-space analogue of ridge. The ridge solution is a vector that minimizes squared loss plus an L2L_2 penalty on coefficients. The smoothing spline solution is a function that minimizes squared loss plus an L2L_2 penalty on the second derivative. The solution has finite dimension by the representer theorem; the dimension equals the sample size; the linear solver is banded with bandwidth 44. Everything that ridge gets to do, a smoothing spline does in function space.

ESL 2nd ed. §5.4 (pp. 151-153) develops the natural-cubic-spline characterization. The RKHS view is §5.8 (pp. 167-173). Green and Silverman (1994) is the canonical monograph.

Quick Version

ObjectForm
Objectivei(Yif(Xi))2+λ(f(x))2dx\sum_i (Y_i - f(X_i))^2 + \lambda \int (f''(x))^2 \, dx over fW22f \in W^2_2
Solutionnatural cubic spline with knots at the XiX_i
Hat matrixSλ=(I+λK)1\boldsymbol{S}_\lambda = (\boldsymbol{I} + \lambda \boldsymbol{K})^{-1} for a banded K\boldsymbol{K}
Effective DFdf(λ)=tr(Sλ)\mathrm{df}(\lambda) = \mathrm{tr}(\boldsymbol{S}_\lambda)
λ0\lambda \to 0interpolating natural cubic spline; df=n\mathrm{df} = n
λ\lambda \to \inftyleast-squares affine fit; df=2\mathrm{df} = 2
GCVGCV(λ)=n1i(Yif^λ(Xi))2(1n1tr(Sλ))2\mathrm{GCV}(\lambda) = \frac{n^{-1} \sum_i (Y_i - \hat{f}_\lambda(X_i))^2}{(1 - n^{-1} \mathrm{tr}(\boldsymbol{S}_\lambda))^2}
Reinsch solverbanded O(n)O(n) to compute f^λ\hat{f}_\lambda

The "natural" in "natural cubic spline" means linearity outside the data range: f(x)=0f''(x) = 0 for x<X(1)x < X_{(1)} and x>X(n)x > X_{(n)}. This is forced by the roughness penalty.

Formal Setup

Definition

Sobolev Space W^2_2(a, b)

W22([a,b])W^2_2([a, b]) is the space of functions f:[a,b]Rf: [a, b] \to \mathbb{R} that are absolutely continuous with absolutely continuous derivative ff' and ab(f(x))2dx<\int_a^b (f''(x))^2 \, dx < \infty. The penalty (f)2\int (f'')^2 is finite precisely on this space; the integral is the squared L2L_2 norm of ff''.

Definition

Smoothing Spline

Given (X1,Y1),,(Xn,Yn)(X_1, Y_1), \ldots, (X_n, Y_n) with distinct Xi[a,b]X_i \in [a, b] and λ>0\lambda > 0, the smoothing spline of order 2 (cubic) is the minimizer of Lλ(f)=i=1n(Yif(Xi))2+λab(f(x))2dx\mathcal{L}_\lambda(f) = \sum_{i=1}^n (Y_i - f(X_i))^2 + \lambda \int_a^b (f''(x))^2 \, dx over fW22([a,b])f \in W^2_2([a, b]). The minimizer exists and is unique; it is a natural cubic spline with knots at the distinct XiX_i.

The same idea generalizes: penalize (f(m))2\int (f^{(m)})^2 instead, and the minimizer is a natural spline of order 2m2m (so m=2m = 2 gives cubic; m=1m = 1 gives linear). The cubic case is the universal default.

The Natural Cubic Spline Solution

Theorem

Solution to the Penalized Least Squares

Statement

The minimizer f^λ\hat{f}_\lambda of Lλ\mathcal{L}_\lambda over W22([a,b])W^2_2([a, b]) is a natural cubic spline with knots at the distinct XiX_i. In particular, f^λ\hat{f}_\lambda is a piecewise cubic polynomial that is C2C^2 at each knot and affine on the two outer intervals (,X(1))(-\infty, X_{(1)}) and (X(n),)(X_{(n)}, \infty).

Writing f=(f(X1),,f(Xn))\boldsymbol{f} = (f(X_1), \ldots, f(X_n))^\top and parametrizing the natural cubic spline by its values at the knots, the fitted vector f^λ=SλY\hat{\boldsymbol{f}}_\lambda = \boldsymbol{S}_\lambda \boldsymbol{Y} where Sλ=(In+λK)1\boldsymbol{S}_\lambda = (\boldsymbol{I}_n + \lambda \boldsymbol{K})^{-1} and K\boldsymbol{K} is the symmetric tridiagonal-times-tridiagonal "smoother penalty" matrix defined by the integrated second-derivative form on the natural cubic spline basis.

Intuition

The minimizer must absorb the residual at each XiX_i exactly (otherwise you could move it toward YiY_i a bit), subject to a penalty on roughness. The extremals of the roughness penalty are exactly cubic polynomials between knots and affine in the outer intervals (any extra curvature in the outer regions can be removed without changing the data fit and only reducing the penalty).

The matrix K\boldsymbol{K} is sparse: in the natural-cubic-spline basis at the knots, the penalty form has bandwidth 44. So the linear system (I+λK)f=Y(\boldsymbol{I} + \lambda \boldsymbol{K}) \boldsymbol{f} = \boldsymbol{Y} solves in O(n)O(n) by the Reinsch algorithm.

Why It Matters

This is the result that makes the infinite-dimensional optimization practical. Without it, "minimize over W22W^2_2" would require some kind of discretization. With it, the discrete representation at the knots is exact: nn degrees of freedom, banded solver, no approximation. The analogous fact for kernel ridge regression is the representer theorem; in both cases the function-space optimization collapses to a finite-dimensional problem of size nn.

Failure Mode

Two failure modes. (i) Repeated XiX_i: the natural-cubic-spline knot set is determined by distinct XiX_i only. Replicates contribute to the residual sum of squares at the same knot. The estimator is well-defined but pre-processing into mean response per unique XiX_i with weights is cleaner. (ii) Very small λ\lambda: Sλ\boldsymbol{S}_\lambda approaches the identity, f^λ\hat{f}_\lambda approaches the interpolating spline, variance blows up. The estimator is consistent only when λ\lambda scales correctly with nn, namely λn4/(2m+1)\lambda \asymp n^{-4/(2m+1)} for an order-mm spline.

Optional ProofReinsch algorithm and the banded structureShow

ESL 2nd ed. Appendix to Ch 5 (pp. 189-191) and Green and Silverman (1994) §2.3 work this out.

Let the distinct knots be τ1<τ2<<τn\tau_1 < \tau_2 < \cdots < \tau_n. Define the forward differences hi=τi+1τih_i = \tau_{i+1} - \tau_i for i=1,,n1i = 1, \ldots, n - 1. Build the tridiagonal (n2)×n(n - 2) \times n matrix R\boldsymbol{R} with entries Ri,i=1/hi,Ri,i+1=(1/hi+1/hi+1),Ri,i+2=1/hi+1R_{i, i} = 1 / h_i, \quad R_{i, i+1} = -(1/h_i + 1/h_{i+1}), \quad R_{i, i+2} = 1 / h_{i+1} and the tridiagonal (n2)×(n2)(n - 2) \times (n - 2) matrix Q\boldsymbol{Q} with entries Qi,i=13(hi+hi+1),Qi,i+1=Qi+1,i=16hi+1.Q_{i,i} = \tfrac{1}{3}(h_i + h_{i+1}), \quad Q_{i, i+1} = Q_{i+1, i} = \tfrac{1}{6} h_{i+1}. Then the penalty matrix on the values f\boldsymbol{f} at the knots is K=RQ1R.\boldsymbol{K} = \boldsymbol{R}^\top \boldsymbol{Q}^{-1} \boldsymbol{R}.

Solving (I+λK)f=Y(\boldsymbol{I} + \lambda \boldsymbol{K}) \boldsymbol{f} = \boldsymbol{Y} naively costs O(n3)O(n^3). The Reinsch trick: introduce γ=Q1Rf\boldsymbol{\gamma} = \boldsymbol{Q}^{-1} \boldsymbol{R} \boldsymbol{f} (the second derivatives at the interior knots). Solve the joint system (IλRRQ/λ)(fγ)=(Y0).\begin{pmatrix} \boldsymbol{I} & \lambda \boldsymbol{R}^\top \\ \boldsymbol{R} & -\boldsymbol{Q}/\lambda \end{pmatrix} \begin{pmatrix} \boldsymbol{f} \\ \boldsymbol{\gamma}\end{pmatrix} = \begin{pmatrix} \boldsymbol{Y} \\ \boldsymbol{0} \end{pmatrix}. This is banded with bandwidth 44 and solves in O(n)O(n).

Equivalently, f=YλRγ\boldsymbol{f} = \boldsymbol{Y} - \lambda \boldsymbol{R}^\top \boldsymbol{\gamma} and γ\boldsymbol{\gamma} solves (Q+λRR)γ=RY(\boldsymbol{Q} + \lambda \boldsymbol{R} \boldsymbol{R}^\top) \boldsymbol{\gamma} = \boldsymbol{R} \boldsymbol{Y}, which is a banded (n2)×(n2)(n - 2) \times (n - 2) system. The constant in the O(n)O(n) is small enough that smoothing splines scale to n=106n = 10^6 on a laptop.

Degrees of Freedom and Bandwidth Selection

A linear smoother f^=SY\hat{\boldsymbol{f}} = \boldsymbol{S} \boldsymbol{Y} has an effective degrees of freedom equal to tr(S)\mathrm{tr}(\boldsymbol{S}). For the smoothing spline, tr(Sλ)\mathrm{tr}(\boldsymbol{S}_\lambda) runs from 22 (when λ\lambda \to \infty, the fit is an affine line in two parameters) to nn (when λ0\lambda \to 0, the fit interpolates and uses all nn degrees of freedom). The mapping is monotone, so λ\lambda can be selected by choosing a target df\mathrm{df} value.

Generalized cross-validation (Craven and Wahba, 1979). Define GCV(λ)=n1i(Yif^λ(Xi))2(1n1tr(Sλ))2.\mathrm{GCV}(\lambda) = \frac{n^{-1} \sum_i (Y_i - \hat{f}_\lambda(X_i))^2}{(1 - n^{-1} \mathrm{tr}(\boldsymbol{S}_\lambda))^2}. This is an approximation to leave-one-out cross-validated MSE that replaces each SiiS_{ii} in the denominator by tr(Sλ)/n\mathrm{tr}(\boldsymbol{S}_\lambda)/n. GCV is computationally cheap because each evaluation costs only one linear solve and one trace evaluation. ESL 2nd ed. §5.5 (pp. 156-161) treats GCV in detail.

LOO closed form. For any linear smoother, LOO^(λ)=1ni=1n(Yif^λ(Xi)1Sii(λ))2.\widehat{\mathrm{LOO}}(\lambda) = \frac{1}{n} \sum_{i=1}^n \left(\frac{Y_i - \hat{f}_\lambda(X_i)}{1 - S_{ii}(\lambda)}\right)^2. The diagonal entries Sii(λ)S_{ii}(\lambda) are available in O(n)O(n) from the Reinsch decomposition. LOO is unbiased but a little noisier than GCV.

Marginal likelihood / REML. Treat the spline as the posterior mean under a Gaussian process prior with smoothness controlled by λ\lambda. The marginal likelihood, profiled over λ\lambda, gives a Bayesian selector. Wahba's Spline Models for Observational Data (1990) Ch 1-4 develops this view.

The RKHS Interpretation

Optional Deeper DetailSmoothing spline as kernel ridge regression on a Sobolev RKHSShow

ESL 2nd ed. §5.8 (pp. 167-173) develops this. The second-order Sobolev space W22([a,b])W^2_2([a, b]) is a reproducing kernel Hilbert space when equipped with the inner product f,g=f(a)g(a)+f(a)g(a)+abf(x)g(x)dx.\langle f, g \rangle = f(a) g(a) + f'(a) g'(a) + \int_a^b f''(x) g''(x) \, dx. The reproducing kernel is k(x,x)=1+(xa)(xa)+ab(xu)+(xu)+du,k(x, x') = 1 + (x - a)(x' - a) + \int_a^b (x - u)_+ (x' - u)_+ \, du, where (t)+=max(t,0)(t)_+ = \max(t, 0). The two-dimensional "polynomial" part 1+(xa)(xa)1 + (x - a)(x' - a) corresponds to the unpenalized null space of the penalty (f)2\int (f'')^2. The integral part gives the genuine kernel.

The smoothing-spline objective is the kernel ridge regression problem i(Yif(Xi))2+λfK2\sum_i (Y_i - f(X_i))^2 + \lambda \|f\|_K^2 on this RKHS, with fK2\|f\|_K^2 excluding the affine null space. The representer theorem gives f^λ(x)=iαik(x,Xi)+β0+β1x\hat{f}_\lambda(x) = \sum_i \alpha_i k(x, X_i) + \beta_0 + \beta_1 x. The αi\alpha_i solve a kernel-ridge system. The connection makes smoothing splines a special case of kernel methods: same RKHS theory, same representer theorem, same finite-dimensional reduction.

The view extends naturally to higher dimensions: replace (f)2\int (f'')^2 with (2f)2\int (\nabla^2 f)^2 and the resulting RKHS gives thin-plate splines instead.

Implementation Notes

The standard implementation is the Reinsch algorithm: O(n)O(n) for one λ\lambda, O(n)O(n) per additional λ\lambda along a path, total O(nL)O(nL) for LL values of λ\lambda along a grid. The smoother matrix Sλ\boldsymbol{S}_\lambda is dense; never form it explicitly. The trace tr(Sλ)\mathrm{tr}(\boldsymbol{S}_\lambda) is computed via the LDL^\top factorization of the banded system without forming Sλ\boldsymbol{S}_\lambda.

In R, smooth.spline() is the standard interface; in Python, scipy.interpolate.UnivariateSpline and the pygam package both expose smoothing splines. The default in smooth.spline is GCV-selected λ\lambda.

For multivariate inputs the natural univariate-spline structure breaks; the right generalization is thin-plate splines in dimension d=2d = 2 and tensor-product splines for d3d \geq 3.

Canonical Example

Example

Smoothing a temperature time series

A weather station records daily mean temperature on a regular grid for n=365n = 365 days. The signal has a clear annual cycle plus weather noise. Fit smoothing splines at three smoothing levels.

Effective df\mathrm{df}λ\lambda (data-dependent)Visual outcome
df=8\mathrm{df} = 8largeclean smooth annual cycle; weather noise removed
df=25\mathrm{df} = 25mediumannual cycle plus larger weather events
df=100\mathrm{df} = 100smallfollows daily fluctuations; the signal is in the noise

GCV selects df\mathrm{df} around 1212 to 1515 for typical mid-latitude data. The bandwidth-selection question is whether you want the climate signal (low df) or the weather signal (high df); GCV picks the MSE-optimal point, which is climate-plus-major-weather-events.

The Reinsch solver returns the fit in roughly 0.0050.005 seconds for n=365n = 365 on a laptop. The same fit by naive Sλ=(I+λK)1\boldsymbol{S}_\lambda = (\boldsymbol{I} + \lambda \boldsymbol{K})^{-1} costs roughly 0.050.05 seconds; the difference matters at n=105n = 10^5 and above.

Common Confusions

Watch Out

A smoothing spline is not the same as a regression spline

A regression spline fits an ordinary least squares regression on a fixed basis of splines with a chosen number of knots (typically much fewer than nn). A smoothing spline puts a knot at every distinct XiX_i and controls flexibility through the roughness penalty. The two are related (smoothing splines are penalized regression on the maximal natural-cubic-spline basis) but the practical workflow is different: regression splines tune the knot count; smoothing splines tune λ\lambda.

Watch Out

The smoothing parameter and the bandwidth are not the same object

λ\lambda in a smoothing spline and hh in a kernel method both control the bias-variance tradeoff, but they parametrize different things. λ\lambda trades off data fit against integrated curvature globally; hh controls the width of the local averaging window. The two are related asymptotically (Silverman, 1984: a smoothing spline acts like a kernel smoother with a varying bandwidth proportional to λ1/4f(x)1/4\lambda^{1/4} f(x)^{-1/4}) but they are not interchangeable.

Watch Out

Cubic refers to the order of the polynomial, not the order of the derivative

"Cubic smoothing spline" = piecewise cubic polynomial with a penalty on the integrated squared second derivative. The penalty order m=2m = 2 and the polynomial order 2m=42m = 4 (cubic = degree 3 = order 4) are linked but the terminology varies between texts. ESL uses "order two" to mean a penalty on the second derivative; some other texts use "order two" to mean order-two polynomial.

Exercises

ExerciseCore

Problem

Verify that the minimizer of (f)2\int (f'')^2 over functions in W22W^2_2 with prescribed values at two points is the linear function interpolating those values. Hence argue informally why the smoothing-spline solution is affine outside the range of the data.

ExerciseAdvanced

Problem

Show that the effective degrees of freedom tr(Sλ)\mathrm{tr}(\boldsymbol{S}_\lambda) is a strictly decreasing function of λ(0,)\lambda \in (0, \infty), ranging from nn at λ=0+\lambda = 0^+ to 22 at λ=\lambda = \infty.

ExerciseResearch

Problem

The smoothing spline has rate O(n4/5)O(n^{-4/5}) MSE for fW22f \in W^2_2 with optimal λn4/5\lambda \asymp n^{-4/5}. State the corresponding rate for an mm-th-order spline (penalty (f(m))2\int (f^{(m)})^2) applied to a target fW2mf \in W^m_2, and argue why the optimal λ\lambda scales as n2m/(2m+1)n^{-2m/(2m+1)}.

References

Canonical:

  • Hastie, Tibshirani, Friedman. The Elements of Statistical Learning, 2nd ed. Springer (2009). Ch 5 "Basis Expansions and Regularization": §5.4 "Smoothing Splines" (pp. 151-156), §5.5 "Automatic Selection of the Smoothing Parameters" (pp. 156-161), §5.8 "Regularization and Reproducing Kernel Hilbert Spaces" (pp. 167-173), Appendix on computations (pp. 189-191).
  • Green, P. J. and Silverman, B. W. (1994). Nonparametric Regression and Generalized Linear Models: A Roughness Penalty Approach. Chapman and Hall. The canonical monograph; covers smoothing splines, P-splines, the GLM extension, and the Bayesian view.
  • Wahba, G. (1990). Spline Models for Observational Data. SIAM. The RKHS / Bayesian foundation, including thin-plate splines and partial splines.

Foundational:

  • Reinsch, C. H. (1967). "Smoothing by Spline Functions." Numerische Mathematik 10, 177-183. The original O(n)O(n) banded-solver algorithm.
  • Craven, P. and Wahba, G. (1979). "Smoothing Noisy Data with Spline Functions: Estimating the Correct Degree of Smoothing by the Method of Generalized Cross-Validation." Numerische Mathematik 31, 377-403. Introduces GCV.
  • Silverman, B. W. (1984). "Spline Smoothing: The Equivalent Variable Kernel Method." Annals of Statistics 12(3), 898-916. Equivalence with a kernel smoother of varying bandwidth.

Asymptotics:

  • Stone, C. J. (1982). "Optimal Global Rates of Convergence for Nonparametric Regression." Annals of Statistics 10(4), 1040-1053.
  • van de Geer, S. (2000). Empirical Processes in M-Estimation. Cambridge. Smoothing splines as an empirical-process problem.

Next Topics

  • B-splines: the numerically stable basis for cubic splines and the implementation default.
  • Thin-plate splines: the multivariate generalization with a Laplacian-based roughness penalty.
  • Ridge resolvents: the analytic structure of Sλ\boldsymbol{S}_\lambda as λ\lambda varies; same machinery for finite-dimensional ridge.
  • Generalized additive models: smoothing splines as the per-coordinate building block.
  • Local polynomial regression: the local-window alternative; Silverman's equivalent-kernel theorem connects them.

Last reviewed: May 13, 2026

Canonical graph

Required before and derived from this topic

These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.

Required prerequisites

5

Derived topics

2

Graph-backed continuations