Skip to main content

Learning Theory

Kernel Density Estimation

Smooth a sample into a density by placing a small bump on each point. The Rosenblatt-Parzen estimator, its bias-variance decomposition, the optimal bandwidth scaling, why MISE converges at rate n^{-4/5} in one dimension, and why the curse of dimensionality wrecks it in high dimensions.

AdvancedAdvancedTier 1StableSupporting~60 min
For:MLStatsResearch

Why This Matters

A histogram lumps observations into bins of fixed width. The result is piecewise constant, depends on bin placement, and loses information at the bin boundaries. Kernel density estimation replaces the hard bin with a smooth bump centred on each observation. The estimator becomes a continuous function, its bias and variance have a clean decomposition, and the optimal bandwidth has an explicit form in the sample size and the smoothness of the target density.

The reason it earns its own page on a site that already covers k-NN and Nadaraya-Watson regression: KDE is the cleanest illustration of the bias-variance tradeoff in nonparametrics. Bias grows with bandwidth hh (the bump becomes too wide and smooths real structure away). Variance shrinks with hh (more observations contribute to each evaluation point). The product is minimized at a unique hn1/5h^\star \propto n^{-1/5}, and the resulting mean squared error decays at the slow nonparametric rate n4/5n^{-4/5}, not the parametric n1n^{-1}. The slow rate is not a defect of the estimator. It is a property of the problem.

The estimator also makes the curse of dimensionality concrete. In dimension dd the optimal rate degrades to n4/(4+d)n^{-4/(4+d)}. By d=10d = 10 a million samples buy roughly the same MISE as 3030 samples in one dimension. Anyone who has tried to estimate a multivariate density from finite data has run into this wall; KDE is where the wall is easiest to see analytically.

Quick Version

ObjectForm
Estimator f^h(x)\hat{f}_h(x)1nhi=1nK ⁣(xXih)\dfrac{1}{nh} \sum_{i=1}^n K\!\left(\dfrac{x - X_i}{h}\right)
Kernel KKsymmetric, nonnegative, K=1\int K = 1, uK(u)du=0\int u K(u)\,du = 0, σK2=u2K(u)du\sigma_K^2 = \int u^2 K(u)\,du
Pointwise bias12σK2f(x)h2+o(h2)\tfrac{1}{2} \sigma_K^2 \, f''(x) \, h^2 + o(h^2)
Pointwise varianceR(K)nhf(x)+o((nh)1)\dfrac{R(K)}{nh} f(x) + o((nh)^{-1}), where R(K)=K2R(K) = \int K^2
Optimal hhh=(R(K)fnσK4(f)2)1/5n1/5h^\star = \left(\dfrac{R(K) \int f}{n \sigma_K^4 \int (f'')^2}\right)^{1/5} \propto n^{-1/5}
Optimal MISEC(K,f)n4/5C(K, f) \cdot n^{-4/5} in 1D
Rate in Rd\mathbb{R}^dn4/(4+d)n^{-4/(4+d)} for twice-differentiable ff

The Gaussian kernel K(u)=(2π)1/2eu2/2K(u) = (2\pi)^{-1/2} e^{-u^2/2} is the textbook default. The Epanechnikov kernel K(u)=34(1u2)1u1K(u) = \tfrac{3}{4}(1 - u^2)\mathbf{1}_{|u|\leq 1} is asymptotically optimal among nonnegative kernels in the MISE sense (it minimizes the constant C(K,f)C(K, f)). The choice rarely matters in practice; the bandwidth does.

Formal Setup

Definition

Rosenblatt-Parzen Kernel Density Estimator

Given an iid sample X1,,XnX_1, \ldots, X_n from an unknown density ff on R\mathbb{R}, a symmetric nonnegative kernel K:RR0K: \mathbb{R} \to \mathbb{R}_{\geq 0} with K=1\int K = 1, and a bandwidth h>0h > 0, define f^h(x)=1nhi=1nK ⁣(xXih).\hat{f}_h(x) = \frac{1}{nh} \sum_{i=1}^n K\!\left(\frac{x - X_i}{h}\right). The estimator is itself a probability density. It is nonnegative whenever KK is, and integrates to 11 by the substitution u=(xXi)/hu = (x - X_i)/h.

The bandwidth hh controls the bump width. Small hh gives a spikey estimator that follows the sample closely; large hh gives a smooth estimator that washes out local structure.

Definition

Mean Integrated Squared Error (MISE)

The integrated risk of f^h\hat{f}_h is

MISE(h)=E ⁣[(f^h(x)f(x))2dx]=Bias2(f^h(x))dx+Var(f^h(x))dx.\mathrm{MISE}(h) = \mathbb{E}\!\left[\int (\hat{f}_h(x) - f(x))^2\,dx\right] = \int \mathrm{Bias}^2(\hat{f}_h(x))\,dx + \int \mathrm{Var}(\hat{f}_h(x))\,dx.

Bandwidth selection minimizes MISE over hh. The minimizer balances bias (which grows in hh) against variance (which shrinks in hh).

Pointwise Bias-Variance Decomposition

Theorem

Asymptotic MISE for KDE

Statement

Under the assumptions above, the pointwise bias and variance of f^h(x)\hat{f}_h(x) admit the expansion E[f^h(x)]f(x)=12σK2f(x)h2+o(h2),\mathbb{E}[\hat{f}_h(x)] - f(x) = \tfrac{1}{2} \sigma_K^2 \, f''(x) \, h^2 + o(h^2), Var(f^h(x))=R(K)nhf(x)+o ⁣(1nh),\mathrm{Var}(\hat{f}_h(x)) = \frac{R(K)}{nh} f(x) + o\!\left(\frac{1}{nh}\right), where R(K)=K2R(K) = \int K^2 and σK2=u2K(u)du\sigma_K^2 = \int u^2 K(u)\,du. Integrating yields MISE(h)=14σK4h4(f)2+R(K)nh+o(h4+(nh)1).\mathrm{MISE}(h) = \tfrac{1}{4} \sigma_K^4 h^4 \int (f'')^2 + \frac{R(K)}{nh} + o(h^4 + (nh)^{-1}). Minimizing over hh gives hn1/5h^\star \propto n^{-1/5} and MISE(h)=C(K,f)n4/5\mathrm{MISE}(h^\star) = C(K, f) n^{-4/5}.

Intuition

The bias term has ff'' in it because the bump averages the density over a window. Where ff is concave (a peak) the average pulls down. Where ff is convex (a valley) the average pulls up. The second derivative measures that local curvature, and the square of bandwidth measures how wide the averaging window is.

The variance term has f(x)f(x) in it because variance in a sample average scales inversely with effective sample size, and the effective sample size at xx is roughly nhf(x)n h f(x) (the number of observations falling inside a window of width hh centred at xx, which is nn times the probability mass in that window).

Optimal bandwidth balances h4h^4 against 1/(nh)1/(nh). Equate the orders to get hn1/5h \propto n^{-1/5}.

Why It Matters

The n4/5n^{-4/5} rate is the canonical nonparametric rate for twice-differentiable targets. It is slower than the parametric n1n^{-1} rate because we are estimating an infinite-dimensional object. Tsybakov (2009) shows that this rate is minimax-optimal over the Hölder class with smoothness β=2\beta = 2: no estimator does better in the worst case.

Failure Mode

Three regimes break the result. (i) hh does not satisfy h0h \to 0 and nhnh \to \infty. Without this the estimator is inconsistent. (ii) ff is not twice differentiable (jump discontinuities, mixed atoms, fractal support). Bias picks up a different rate. (iii) Boundaries. At the boundary of the support, the kernel mass on the outside cannot be cancelled and the estimator picks up a O(h)O(h) bias rather than O(h2)O(h^2). Boundary corrections (reflection, boundary kernels, local polynomial fitting) restore the interior rate.

Optional ProofDerivation of the bias-variance expansionShow

This follows ESL 2nd ed. §6.6.1 (pp. 208-209) and Wasserman 2006 §6.2. Write Kh(u)=h1K(u/h)K_h(u) = h^{-1} K(u/h). Then f^h(x)=n1iKh(xXi)\hat{f}_h(x) = n^{-1} \sum_i K_h(x - X_i) is a sum of iid random variables.

Bias. Using a change of variables u=(xy)/hu = (x - y)/h, E[f^h(x)]=Kh(xy)f(y)dy=K(u)f(xhu)du.\mathbb{E}[\hat{f}_h(x)] = \int K_h(x - y) f(y)\,dy = \int K(u) f(x - h u)\,du. Expand f(xhu)f(x - hu) to second order in hh: f(xhu)=f(x)huf(x)+12h2u2f(x)+o(h2).f(x - hu) = f(x) - h u f'(x) + \tfrac{1}{2} h^2 u^2 f''(x) + o(h^2). Integrate against KK. The constant term gives f(x)f(x). The linear term vanishes because uK(u)du=0\int u K(u)\,du = 0 by symmetry. The quadratic term contributes 12h2σK2f(x)\tfrac{1}{2} h^2 \sigma_K^2 f''(x).

Variance. By independence, Var(f^h(x))=1nVar(Kh(xX1)).\mathrm{Var}(\hat{f}_h(x)) = \frac{1}{n}\mathrm{Var}(K_h(x - X_1)). The dominant term is n1E[Kh(xX1)2]n^{-1} \mathbb{E}[K_h(x - X_1)^2] minus the squared mean, which is lower order. Compute E[Kh(xX1)2]=Kh(xy)2f(y)dy=1hK(u)2f(xhu)du=R(K)f(x)h+O(h).\mathbb{E}[K_h(x - X_1)^2] = \int K_h(x - y)^2 f(y)\,dy = \frac{1}{h} \int K(u)^2 f(x - h u)\,du = \frac{R(K) f(x)}{h} + O(h). Divide by nn.

MISE. Square the bias and integrate; integrate the variance over xx; add. Optimize the resulting h4+(nh)1h^4 + (nh)^{-1} by setting the derivative to zero, getting h=(R(K)f/nσK4(f)2)1/5h^\star = \left(R(K) \int f \, / \, n \sigma_K^4 \int (f'')^2\right)^{1/5}.

Bandwidth Selection in Practice

Three families of methods.

Rule of thumb (Silverman, 1986). For a Gaussian target with variance σ^2\hat{\sigma}^2 and Gaussian kernel, hSilv=1.06σ^n1/5h_{\mathrm{Silv}} = 1.06 \hat{\sigma} n^{-1/5}. Cheap, surprisingly competitive for unimodal smooth densities, badly biased for multimodal or heavy-tailed densities.

Plug-in (Sheather and Jones, 1991). Estimate (f)2\int (f'')^2 from a pilot bandwidth and substitute into the asymptotic-optimal formula. This is the default in most statistical packages (e.g. R's density(...) uses a plug-in by default).

Cross-validation. Leave-one-out cross-validated MISE selects hh that minimizes MISE^LOO(h)=f^h22ni=1nf^h,i(Xi),\widehat{\mathrm{MISE}}_{\mathrm{LOO}}(h) = \int \hat{f}_h^2 - \frac{2}{n} \sum_{i=1}^n \hat{f}_{h, -i}(X_i), where f^h,i\hat{f}_{h, -i} omits observation ii. CV bandwidth is unbiased for MISE up to a constant, but high-variance in practice; the selected hh fluctuates substantially across resamples.

Example

Picking a bandwidth for a univariate sample

A sample of n=200n = 200 from a N(0,1)\mathcal{N}(0, 1) density. Compute three bandwidths.

Methodhh
Silverman's rule, σ^=1\hat{\sigma} = 11.0612001/50.3661.06 \cdot 1 \cdot 200^{-1/5} \approx 0.366
Plug-in (Sheather-Jones)typically 0.320.32 to 0.360.36
Leave-one-out CVhigh variance; values from 0.250.25 to 0.450.45 across resamples

For this target, Silverman is exactly the asymptotic-optimal rate (the target is Gaussian; the rule is derived from a Gaussian reference). For a N(0,1)+N(3,0.25)\mathcal{N}(0, 1) + \mathcal{N}(3, 0.25) mixture, Silverman over-smooths; plug-in is more reliable.

The Curse of Dimensionality

In Rd\mathbb{R}^d with a product kernel K(u)=K1(uj)K(u) = \prod K_1(u_j) and a single bandwidth hh, the asymptotic MISE becomes MISE(h)=O(h4)+O((nhd)1).\mathrm{MISE}(h) = O(h^4) + O((nh^d)^{-1}). The variance term scales as nhdnh^d because the effective sample size in a dd-dimensional ball of radius hh is proportional to nhdnh^d. Optimizing gives hn1/(4+d)h^\star \propto n^{-1/(4+d)} and MISE(h)n4/(4+d).\mathrm{MISE}(h^\star) \propto n^{-4/(4+d)}. The rate degrades sharply with dd. For d=1d = 1 the rate is n4/5n^{-4/5}, matching the univariate result. For d=10d = 10 the rate is n4/14n0.286n^{-4/14} \approx n^{-0.286}. To reach MISE =0.01= 0.01 in 1D you need roughly n100n \approx 100; in 10D you need roughly n1017n \approx 10^{17}. The curse is real and is not patched by better kernels.

In practice this means KDE is essentially useless past d5d \approx 5 to 77 without additional structure (sparsity, low intrinsic dimension, parametric components). The same wall hits k-NN density estimation, local polynomial regression, and any other purely-local estimator.

Implementation Notes

The estimator computed naively is O(n)O(n) per evaluation point, so evaluating f^h\hat{f}_h on a grid of mm points costs O(mn)O(mn). Two accelerations matter.

FFT-based evaluation. Bin the data on a grid and convolve with the kernel via FFT. Cost O(NlogN)O(N \log N) for NN grid points, independent of nn. This is what R's density() does. The bin-grid step introduces a small discretization error but the result is visually identical for any reasonable grid resolution.

Tree-based approximation (dual tree). For high nn and irregular grids, ball-tree or kd-tree algorithms drop the per-query cost to O(logn)O(\log n) in moderate dimension. Gray and Moore (2001) and the mlpack implementation cover the details.

Boundary corrections are the other practical issue. The standard fix at a boundary x=0x = 0 is to replace the kernel with a boundary kernel Kb(u)K_b(u) that has 0Kb=1\int_0^\infty K_b = 1 and 0uKb=0\int_0^\infty u K_b = 0. Linear local polynomial smoothing (see local polynomial regression) does this automatically and is the cleaner option.

Connection to Naive Bayes and Classification

ESL 2nd ed. §6.6.2 and §6.6.3 (pp. 210-211) cover the use of KDE inside naive Bayes for classification. The naive Bayes classifier estimates fk(x)f_k(x) for each class kk as the product of marginal class-conditional densities jf^k,j(xj)\prod_j \hat{f}_{k,j}(x_j), where each f^k,j\hat{f}_{k,j} is a univariate KDE. The product assumption is the "naive" part. It sidesteps the curse of dimensionality at the cost of a modelling assumption that is wrong in general but often empirically useful. The mismatch is the source of every cautionary remark about naive Bayes in the textbook.

Canonical Examples

Example

Estimating a bimodal density

Sample 400 points from 12N(1.5,0.5)+12N(1.5,0.5)\tfrac{1}{2}\mathcal{N}(-1.5, 0.5) + \tfrac{1}{2}\mathcal{N}(1.5, 0.5).

BandwidthVisual outcome
h=0.05h = 0.05spikey; each sample point visible as a bump
h=0.30h = 0.30two clean modes, near-optimal
h=1.00h = 1.00single mode; modes blurred away

The plug-in bandwidth selects roughly h0.27h \approx 0.27. The Silverman rule of thumb selects h0.45h \approx 0.45 because σ^1.5\hat{\sigma} \approx 1.5 counts the inter-mode spread; the rule treats the sample as unimodal and over-smooths.

Example

A heavy-tailed target

Sample from a Cauchy distribution, n=1000n = 1000. The Cauchy has no second moment so ff'' is heavy-tailed and (f)2=\int (f'')^2 = \infty. The MISE optimal-bandwidth formula breaks down. In practice the bias term grows in the tails because ff'' is large there. Local-bandwidth methods (varying hh with xx) help; uniform-bandwidth KDE does not.

Common Confusions

Watch Out

The kernel choice barely matters; the bandwidth matters a lot

The kernel choice changes the constant C(K,f)C(K, f) in MISE by at most 10-15% across the standard nonnegative kernels (Epanechnikov is at the lower edge, Gaussian and tricube within a few percent). Changing the bandwidth by a factor of two typically changes MISE by 50-100%. Spend the effort on bandwidth, not kernel.

Watch Out

KDE is not the same as a smoothed histogram

A smoothed histogram first bins the data, then smooths the bin heights. KDE places a kernel directly on each observation. The two coincide only in the limit of zero bin width. With finite bins the smoothed-histogram estimator has additional discretization bias; the KDE does not.

Watch Out

A negative-lobe kernel can have lower bias but is not a density

Higher-order kernels (those with ujK(u)du=0\int u^j K(u)\,du = 0 for j=1,2,,r1j = 1, 2, \ldots, r-1) have bias of order hrh^r instead of h2h^2. The MISE rate improves to n2r/(2r+d)n^{-2r/(2r+d)}. The catch: such kernels are negative on part of their support, so f^h\hat{f}_h can be negative and is not a density. For pure density estimation this disqualifies them; for purposes where you only need a smooth estimate of ff (such as plugin into a downstream computation), they are useful.

Exercises

ExerciseCore

Problem

Let K(u)=121[1,1](u)K(u) = \tfrac{1}{2} \mathbf{1}_{[-1, 1]}(u) (the rectangular kernel) and X1,,XnX_1, \ldots, X_n be iid Uniform([0,1])\mathrm{Uniform}([0, 1]). Compute the bias of f^h(x)\hat{f}_h(x) at an interior point x(h,1h)x \in (h, 1 - h), and at the boundary point x=0x = 0. What rate in hh does each have?

ExerciseAdvanced

Problem

For the Gaussian kernel K(u)=ϕ(u)=(2π)1/2eu2/2K(u) = \phi(u) = (2\pi)^{-1/2} e^{-u^2/2}, compute R(K)=K2R(K) = \int K^2 and σK2=u2K(u)du\sigma_K^2 = \int u^2 K(u)\,du. Then derive Silverman's rule hSilv=1.06σ^n1/5h_{\mathrm{Silv}} = 1.06 \hat{\sigma} n^{-1/5} by plugging the Gaussian reference f=ϕσf = \phi_\sigma into the asymptotic optimal formula.

ExerciseResearch

Problem

A density ff on [0,1][0, 1] has a known jump discontinuity at x0(0,1)x_0 \in (0, 1). The standard KDE picks up a O(h)O(h) bias at x0x_0. Propose a modification that recovers the O(h2)O(h^2) interior rate. Discuss the sample-size cost of locating the jump from data when x0x_0 is unknown.

References

Canonical:

  • Hastie, Tibshirani, Friedman. The Elements of Statistical Learning, 2nd ed. Springer (2009). Ch 6 "Kernel Smoothing Methods", §6.6 "Kernel Density Estimation and Classification" (pp. 208-211). Quick textbook treatment with the classification framing.
  • Wasserman. All of Nonparametric Statistics. Springer (2006). Ch 6 "Density Estimation". The canonical American-statistics graduate treatment; includes plug-in bandwidth and bootstrap variance.
  • Tsybakov. Introduction to Nonparametric Estimation. Springer (2009). Ch 1 (kernel estimators), Ch 2 (lower bounds). The reference for minimax rates and the Hölder-class formulation.

Foundational:

  • Rosenblatt, M. (1956). "Remarks on Some Nonparametric Estimates of a Density Function." Annals of Mathematical Statistics 27(3), 832-837. Original construction.
  • Parzen, E. (1962). "On Estimation of a Probability Density Function and Mode." Annals of Mathematical Statistics 33(3), 1065-1076. Asymptotic analysis.
  • Silverman, B. W. (1986). Density Estimation for Statistics and Data Analysis. Chapman and Hall. The early monograph; source of the rule of thumb.

Bandwidth selection:

  • Sheather, S. J. and Jones, M. C. (1991). "A Reliable Data-Based Bandwidth Selection Method for Kernel Density Estimation." Journal of the Royal Statistical Society B 53(3), 683-690. The plug-in default in modern statistical software.

Curse of dimensionality:

  • Stone, C. J. (1980). "Optimal Rates of Convergence for Nonparametric Estimators." Annals of Statistics 8(6), 1348-1360. Establishes the n4/(4+d)n^{-4/(4+d)} minimax rate.

Next Topics

Last reviewed: May 13, 2026

Canonical graph

Required before and derived from this topic

These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.

Graph-backed continuations