Beta. Content is under active construction and has not been peer-reviewed. Report errors on GitHub.Disclaimer

ML Methods

Logspline Density Estimation

Model the log-density as a spline, then normalize to get a smooth, positive density estimate. Connection to exponential families, knot selection by BIC, and flexible nonparametric density estimation.

AdvancedTier 3Stable~35 min
0

Why This Matters

Kernel density estimation (KDE) is the standard nonparametric density estimator, but it has limitations: bandwidth selection is tricky, boundary effects are problematic, and the resulting density can be wiggly with many bumps that reflect noise rather than structure.

Logspline density estimation offers an alternative: model logf(x)\log f(x) as a spline and exponentiate to get a density. Fitting is done via maximum likelihood estimation. The result is always positive (because eanything>0e^{\text{anything}} > 0), always smooth (because splines are smooth), and the complexity is controlled by the number and placement of knots rather than a bandwidth parameter.

Formal Setup

Definition

Logspline Density

A logspline density has the form:

f(x;θ)=exp(j=1JθjBj(x))exp(j=1JθjBj(x))dxf(x; \theta) = \frac{\exp\left(\sum_{j=1}^{J} \theta_j B_j(x)\right)}{\int \exp\left(\sum_{j=1}^{J} \theta_j B_j(x)\right) dx}

where B1(x),,BJ(x)B_1(x), \ldots, B_J(x) are B-spline basis functions (typically cubic) with knots t1,,tKt_1, \ldots, t_K, and θ=(θ1,,θJ)\theta = (\theta_1, \ldots, \theta_J) are coefficients estimated from data.

The denominator ensures f(x;θ)dx=1\int f(x; \theta) dx = 1. This integral is computed numerically (it has no closed form in general).

Connection to Exponential Families

Proposition

Logsplines as Exponential Families

Statement

For a fixed set of knots, the logspline family {f(x;θ):θRJ}\{f(x;\theta) : \theta \in \mathbb{R}^J\} is a JJ-parameter exponential family with sufficient statistics T(x)=(B1(x),,BJ(x))T(x) = (B_1(x), \ldots, B_J(x)) and natural parameters θ=(θ1,,θJ)\theta = (\theta_1, \ldots, \theta_J).

Intuition

The log-density is linear in θ\theta: logf(x;θ)=jθjBj(x)logZ(θ)\log f(x; \theta) = \sum_j \theta_j B_j(x) - \log Z(\theta). This is exactly the canonical form of an exponential family where the log-partition function is logZ(θ)=logexp(jθjBj(x))dx\log Z(\theta) = \log \int \exp(\sum_j \theta_j B_j(x)) dx.

Proof Sketch

Write f(x;θ)=h(x)exp(θT(x)A(θ))f(x;\theta) = h(x) \exp(\theta^\top T(x) - A(\theta)) with h(x)=1h(x) = 1, T(x)=(B1(x),,BJ(x))T(x) = (B_1(x), \ldots, B_J(x)), and A(θ)=logexp(θT(x))dxA(\theta) = \log \int \exp(\theta^\top T(x)) dx. This matches the exponential family definition.

Why It Matters

Exponential family properties give us computational advantages: the log-likelihood is concave in θ\theta (so MLE has a unique global maximum), the MLE satisfies moment-matching conditions E^[Bj(X)]=Eθ^[Bj(X)]\hat{\mathbb{E}}[B_j(X)] = \mathbb{E}_{\hat{\theta}}[B_j(X)], and standard exponential family theory provides asymptotic normality of θ^\hat{\theta}.

Failure Mode

The exponential family structure holds only for fixed knots. When knots are selected from the data (as in practice), the overall procedure is no longer a pure exponential family MLE. The knot selection step is a model selection step that falls outside the exponential family framework.

Estimation

Given data x1,,xnx_1, \ldots, x_n, the log-likelihood is:

(θ)=i=1nj=1JθjBj(xi)nlogZ(θ)\ell(\theta) = \sum_{i=1}^{n} \sum_{j=1}^{J} \theta_j B_j(x_i) - n \log Z(\theta)

Since the logspline family is an exponential family with fixed knots, this is concave in θ\theta. Standard Newton-Raphson converges to the global maximum.

The gradient and Hessian involve expectations under the current model, computed by numerical integration.

Knot Selection

The number and placement of knots controls model complexity:

  • Too few knots: the density is too smooth, missing important features (underfitting)
  • Too many knots: the density is too flexible, fitting noise (overfitting)
Theorem

Logspline Consistency

Statement

The logspline MLE f^n\hat{f}_n converges to f0f_0 in Kullback-Leibler divergence: DKL(f0f^n)0D_{\text{KL}}(f_0 \| \hat{f}_n) \to 0 as nn \to \infty. The rate is O(K2r+K/n)O(K^{-2r} + K/n) where rr is the smoothness order.

Intuition

The bias term K2rK^{-2r} decreases as more knots capture finer details. The variance term K/nK/n increases with more knots because more parameters must be estimated. The optimal KK balances these two terms.

Proof Sketch

The approximation error is controlled by spline approximation theory: a spline with KK knots approximates a CrC^r function to accuracy O(Kr)O(K^{-r}) in sup-norm. The statistical estimation error for a JJ-parameter exponential family is O(J/n)=O(K/n)O(J/n) = O(K/n) by standard MLE theory. Combining gives the stated rate.

Why It Matters

This result justifies using logsplines for density estimation: with the right number of knots, the estimate converges at a near-optimal nonparametric rate. BIC-based knot selection achieves the optimal balance automatically.

Failure Mode

If the true density has unbounded support (e.g., Gaussian), the compact support assumption is violated. In practice, truncation to the data range is used, which introduces edge effects. If the true density has zeros (regions where f0(x)=0f_0(x) = 0), the logspline model f(x)>0f(x) > 0 everywhere cannot represent this exactly.

BIC-based knot selection: start with a minimal number of knots. Iteratively add knots at locations that maximize the decrease in BIC =2(θ^)+Jlogn= -2\ell(\hat{\theta}) + J \log n. Stop when adding a knot no longer decreases BIC. Optionally, delete knots that do not contribute.

Common Confusions

Watch Out

Logsplines are not log-transformed KDE

A logspline models logf(x)\log f(x) as a spline and optimizes the likelihood. Applying a kernel density estimator to log(xi)\log(x_i) and then transforming back is a completely different procedure that does not produce a logspline estimate.

Watch Out

The normalizing constant matters

Unlike regression splines where the scale of the fitted function is free, in density estimation the function must integrate to 1. The normalizing constant Z(θ)Z(\theta) depends on θ\theta and must be recomputed at each optimization step. This makes logspline fitting more expensive than ordinary spline fitting.

Summary

  • Logspline: f(x)exp(spline(x))f(x) \propto \exp(\text{spline}(x)), always positive and smooth
  • Fixed knots give an exponential family: concave log-likelihood, unique MLE
  • Knot selection by BIC balances bias (too few knots) and variance (too many)
  • Convergence rate O(K2r+K/n)O(K^{-2r} + K/n) with optimal KK
  • More structured than KDE but requires numerical integration for the normalizing constant

Exercises

ExerciseCore

Problem

A logspline model with J=5J = 5 basis functions is fit to n=200n = 200 observations. Write the BIC formula for this model and compute it given log-likelihood (θ^)=280\ell(\hat{\theta}) = -280.

ExerciseAdvanced

Problem

Why is the log-likelihood of the logspline model concave in θ\theta for fixed knots? State the property of exponential families that guarantees this.

References

Canonical:

  • Kooperberg & Stone, Logspline Density Estimation for Censored Data (1992)
  • Stone, Hansen, Kooperberg, Truong, Polynomial Splines and Their Tensor Products in Extended Linear Modeling (1997)

Current:

  • Kooperberg, logspline R package documentation

  • Silverman, Density Estimation for Statistics and Data Analysis (1986), Chapter 4 (context for nonparametric density estimation)

  • Hastie, Tibshirani, Friedman, The Elements of Statistical Learning (2009), Chapters 3-15

  • Bishop, Pattern Recognition and Machine Learning (2006), Chapters 1-14

Last reviewed: April 2026

Prerequisites

Foundations this topic depends on.