Learning Theory
Kernel Density Estimation
Smooth a sample into a density by placing a small bump on each point. The Rosenblatt-Parzen estimator, its bias-variance decomposition, the optimal bandwidth scaling, why MISE converges at rate n^{-4/5} in one dimension, and why the curse of dimensionality wrecks it in high dimensions.
Prerequisites
Why This Matters
A histogram lumps observations into bins of fixed width. The result is piecewise constant, depends on bin placement, and loses information at the bin boundaries. Kernel density estimation replaces the hard bin with a smooth bump centred on each observation. The estimator becomes a continuous function, its bias and variance have a clean decomposition, and the optimal bandwidth has an explicit form in the sample size and the smoothness of the target density.
The reason it earns its own page on a site that already covers k-NN and Nadaraya-Watson regression: KDE is the cleanest illustration of the bias-variance tradeoff in nonparametrics. Bias grows with bandwidth (the bump becomes too wide and smooths real structure away). Variance shrinks with (more observations contribute to each evaluation point). The product is minimized at a unique , and the resulting mean squared error decays at the slow nonparametric rate , not the parametric . The slow rate is not a defect of the estimator. It is a property of the problem.
The estimator also makes the curse of dimensionality concrete. In dimension the optimal rate degrades to . By a million samples buy roughly the same MISE as samples in one dimension. Anyone who has tried to estimate a multivariate density from finite data has run into this wall; KDE is where the wall is easiest to see analytically.
Quick Version
| Object | Form |
|---|---|
| Estimator | |
| Kernel | symmetric, nonnegative, , , |
| Pointwise bias | |
| Pointwise variance | , where |
| Optimal | |
| Optimal MISE | in 1D |
| Rate in | for twice-differentiable |
The Gaussian kernel is the textbook default. The Epanechnikov kernel is asymptotically optimal among nonnegative kernels in the MISE sense (it minimizes the constant ). The choice rarely matters in practice; the bandwidth does.
Formal Setup
Rosenblatt-Parzen Kernel Density Estimator
Given an iid sample from an unknown density on , a symmetric nonnegative kernel with , and a bandwidth , define The estimator is itself a probability density. It is nonnegative whenever is, and integrates to by the substitution .
The bandwidth controls the bump width. Small gives a spikey estimator that follows the sample closely; large gives a smooth estimator that washes out local structure.
Mean Integrated Squared Error (MISE)
The integrated risk of is
Bandwidth selection minimizes MISE over . The minimizer balances bias (which grows in ) against variance (which shrinks in ).
Pointwise Bias-Variance Decomposition
Asymptotic MISE for KDE
Statement
Under the assumptions above, the pointwise bias and variance of admit the expansion where and . Integrating yields Minimizing over gives and .
Intuition
The bias term has in it because the bump averages the density over a window. Where is concave (a peak) the average pulls down. Where is convex (a valley) the average pulls up. The second derivative measures that local curvature, and the square of bandwidth measures how wide the averaging window is.
The variance term has in it because variance in a sample average scales inversely with effective sample size, and the effective sample size at is roughly (the number of observations falling inside a window of width centred at , which is times the probability mass in that window).
Optimal bandwidth balances against . Equate the orders to get .
Why It Matters
The rate is the canonical nonparametric rate for twice-differentiable targets. It is slower than the parametric rate because we are estimating an infinite-dimensional object. Tsybakov (2009) shows that this rate is minimax-optimal over the Hölder class with smoothness : no estimator does better in the worst case.
Failure Mode
Three regimes break the result. (i) does not satisfy and . Without this the estimator is inconsistent. (ii) is not twice differentiable (jump discontinuities, mixed atoms, fractal support). Bias picks up a different rate. (iii) Boundaries. At the boundary of the support, the kernel mass on the outside cannot be cancelled and the estimator picks up a bias rather than . Boundary corrections (reflection, boundary kernels, local polynomial fitting) restore the interior rate.
Optional ProofDerivation of the bias-variance expansionShow
This follows ESL 2nd ed. §6.6.1 (pp. 208-209) and Wasserman 2006 §6.2. Write . Then is a sum of iid random variables.
Bias. Using a change of variables , Expand to second order in : Integrate against . The constant term gives . The linear term vanishes because by symmetry. The quadratic term contributes .
Variance. By independence, The dominant term is minus the squared mean, which is lower order. Compute Divide by .
MISE. Square the bias and integrate; integrate the variance over ; add. Optimize the resulting by setting the derivative to zero, getting .
Bandwidth Selection in Practice
Three families of methods.
Rule of thumb (Silverman, 1986). For a Gaussian target with variance and Gaussian kernel, . Cheap, surprisingly competitive for unimodal smooth densities, badly biased for multimodal or heavy-tailed densities.
Plug-in (Sheather and Jones, 1991). Estimate from a
pilot bandwidth and substitute into the asymptotic-optimal formula. This is
the default in most statistical packages (e.g. R's density(...) uses a
plug-in by default).
Cross-validation. Leave-one-out cross-validated MISE selects that minimizes where omits observation . CV bandwidth is unbiased for MISE up to a constant, but high-variance in practice; the selected fluctuates substantially across resamples.
Picking a bandwidth for a univariate sample
A sample of from a density. Compute three bandwidths.
| Method | |
|---|---|
| Silverman's rule, | |
| Plug-in (Sheather-Jones) | typically to |
| Leave-one-out CV | high variance; values from to across resamples |
For this target, Silverman is exactly the asymptotic-optimal rate (the target is Gaussian; the rule is derived from a Gaussian reference). For a mixture, Silverman over-smooths; plug-in is more reliable.
The Curse of Dimensionality
In with a product kernel and a single bandwidth , the asymptotic MISE becomes The variance term scales as because the effective sample size in a -dimensional ball of radius is proportional to . Optimizing gives and The rate degrades sharply with . For the rate is , matching the univariate result. For the rate is . To reach MISE in 1D you need roughly ; in 10D you need roughly . The curse is real and is not patched by better kernels.
In practice this means KDE is essentially useless past to without additional structure (sparsity, low intrinsic dimension, parametric components). The same wall hits k-NN density estimation, local polynomial regression, and any other purely-local estimator.
Implementation Notes
The estimator computed naively is per evaluation point, so evaluating on a grid of points costs . Two accelerations matter.
FFT-based evaluation. Bin the data on a grid and convolve with the
kernel via FFT. Cost for grid points, independent of .
This is what R's density() does. The bin-grid step introduces a small
discretization error but the result is visually identical for any reasonable
grid resolution.
Tree-based approximation (dual tree). For high and irregular grids, ball-tree or kd-tree algorithms drop the per-query cost to in moderate dimension. Gray and Moore (2001) and the mlpack implementation cover the details.
Boundary corrections are the other practical issue. The standard fix at a boundary is to replace the kernel with a boundary kernel that has and . Linear local polynomial smoothing (see local polynomial regression) does this automatically and is the cleaner option.
Connection to Naive Bayes and Classification
ESL 2nd ed. §6.6.2 and §6.6.3 (pp. 210-211) cover the use of KDE inside naive Bayes for classification. The naive Bayes classifier estimates for each class as the product of marginal class-conditional densities , where each is a univariate KDE. The product assumption is the "naive" part. It sidesteps the curse of dimensionality at the cost of a modelling assumption that is wrong in general but often empirically useful. The mismatch is the source of every cautionary remark about naive Bayes in the textbook.
Canonical Examples
Estimating a bimodal density
Sample 400 points from .
| Bandwidth | Visual outcome |
|---|---|
| spikey; each sample point visible as a bump | |
| two clean modes, near-optimal | |
| single mode; modes blurred away |
The plug-in bandwidth selects roughly . The Silverman rule of thumb selects because counts the inter-mode spread; the rule treats the sample as unimodal and over-smooths.
A heavy-tailed target
Sample from a Cauchy distribution, . The Cauchy has no second moment so is heavy-tailed and . The MISE optimal-bandwidth formula breaks down. In practice the bias term grows in the tails because is large there. Local-bandwidth methods (varying with ) help; uniform-bandwidth KDE does not.
Common Confusions
The kernel choice barely matters; the bandwidth matters a lot
The kernel choice changes the constant in MISE by at most 10-15% across the standard nonnegative kernels (Epanechnikov is at the lower edge, Gaussian and tricube within a few percent). Changing the bandwidth by a factor of two typically changes MISE by 50-100%. Spend the effort on bandwidth, not kernel.
KDE is not the same as a smoothed histogram
A smoothed histogram first bins the data, then smooths the bin heights. KDE places a kernel directly on each observation. The two coincide only in the limit of zero bin width. With finite bins the smoothed-histogram estimator has additional discretization bias; the KDE does not.
A negative-lobe kernel can have lower bias but is not a density
Higher-order kernels (those with for ) have bias of order instead of . The MISE rate improves to . The catch: such kernels are negative on part of their support, so can be negative and is not a density. For pure density estimation this disqualifies them; for purposes where you only need a smooth estimate of (such as plugin into a downstream computation), they are useful.
Exercises
Problem
Let (the rectangular kernel) and be iid . Compute the bias of at an interior point , and at the boundary point . What rate in does each have?
Problem
For the Gaussian kernel , compute and . Then derive Silverman's rule by plugging the Gaussian reference into the asymptotic optimal formula.
Problem
A density on has a known jump discontinuity at . The standard KDE picks up a bias at . Propose a modification that recovers the interior rate. Discuss the sample-size cost of locating the jump from data when is unknown.
References
Canonical:
- Hastie, Tibshirani, Friedman. The Elements of Statistical Learning, 2nd ed. Springer (2009). Ch 6 "Kernel Smoothing Methods", §6.6 "Kernel Density Estimation and Classification" (pp. 208-211). Quick textbook treatment with the classification framing.
- Wasserman. All of Nonparametric Statistics. Springer (2006). Ch 6 "Density Estimation". The canonical American-statistics graduate treatment; includes plug-in bandwidth and bootstrap variance.
- Tsybakov. Introduction to Nonparametric Estimation. Springer (2009). Ch 1 (kernel estimators), Ch 2 (lower bounds). The reference for minimax rates and the Hölder-class formulation.
Foundational:
- Rosenblatt, M. (1956). "Remarks on Some Nonparametric Estimates of a Density Function." Annals of Mathematical Statistics 27(3), 832-837. Original construction.
- Parzen, E. (1962). "On Estimation of a Probability Density Function and Mode." Annals of Mathematical Statistics 33(3), 1065-1076. Asymptotic analysis.
- Silverman, B. W. (1986). Density Estimation for Statistics and Data Analysis. Chapman and Hall. The early monograph; source of the rule of thumb.
Bandwidth selection:
- Sheather, S. J. and Jones, M. C. (1991). "A Reliable Data-Based Bandwidth Selection Method for Kernel Density Estimation." Journal of the Royal Statistical Society B 53(3), 683-690. The plug-in default in modern statistical software.
Curse of dimensionality:
- Stone, C. J. (1980). "Optimal Rates of Convergence for Nonparametric Estimators." Annals of Statistics 8(6), 1348-1360. Establishes the minimax rate.
Next Topics
- Nadaraya-Watson kernel regression: the regression analogue. Same kernel, ratio of weighted sums.
- Local polynomial regression: fixes the boundary bias and improves the leading constant.
- Smoothing splines: a different route to a smooth nonparametric estimate, with an explicit roughness penalty.
- Naive Bayes: KDE as the density estimator inside a generative classifier.
Last reviewed: May 13, 2026
Canonical graph
Required before and derived from this topic
These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.
Required prerequisites
5- Common Probability Distributionslayer 0A · tier 1
- Expectation, Variance, Covariance, and Momentslayer 0A · tier 1
- Modes of Convergence of Random Variableslayer 0B · tier 1
- Bias-Variance Tradeofflayer 2 · tier 2
- Kernels and Reproducing Kernel Hilbert Spaceslayer 3 · tier 2
Derived topics
1- Nadaraya-Watson Kernel Regressionlayer 2 · tier 1
Graph-backed continuations