Kernel Density Estimation

Sneiderman, Robby

Learning Theory

Kernel Density Estimation

Smooth a sample into a density by placing a small bump on each point. The Rosenblatt-Parzen estimator, its bias-variance decomposition, the optimal bandwidth scaling, why MISE converges at rate n^{-4/5} in one dimension, and why the curse of dimensionality wrecks it in high dimensions.

AdvancedAdvancedTier 1StableSupporting~60 min

For:MLStatsResearch

Prerequisites

Common Probability Distributions Expectation Variance Covariance Moments Bias Variance Tradeoff Modes of Convergence Random Variables

Prereq Map

Why This Matters

A histogram lumps observations into bins of fixed width. The result is piecewise constant, depends on bin placement, and loses information at the bin boundaries. Kernel density estimation replaces the hard bin with a smooth bump centred on each observation. The estimator becomes a continuous function, its bias and variance have a clean decomposition, and the optimal bandwidth has an explicit form in the sample size and the smoothness of the target density.

The reason it earns its own page on a site that already covers k-NN and Nadaraya-Watson regression: KDE is the cleanest illustration of the bias-variance tradeoff in nonparametrics. Bias grows with bandwidth $h$ (the bump becomes too wide and smooths real structure away). Variance shrinks with $h$ (more observations contribute to each evaluation point). The product is minimized at a unique $h^\star \propto n^{-1/5}$ , and the resulting mean squared error decays at the slow nonparametric rate $n^{-4/5}$ , not the parametric $n^{-1}$ . The slow rate is not a defect of the estimator. It is a property of the problem.

The estimator also makes the curse of dimensionality concrete. In dimension $d$ the optimal rate degrades to $n^{-4/(4+d)}$ . By $d = 10$ a million samples buy roughly the same MISE as $30$ samples in one dimension. Anyone who has tried to estimate a multivariate density from finite data has run into this wall; KDE is where the wall is easiest to see analytically.

Quick Version

Object	Form
Estimator $\hat{f}_h(x)$	$\dfrac{1}{nh} \sum_{i=1}^n K\!\left(\dfrac{x - X_i}{h}\right)$
Kernel $K$	symmetric, nonnegative, $\int K = 1$ , $\int u K(u)\,du = 0$ , $\sigma_K^2 = \int u^2 K(u)\,du$
Pointwise bias	$\tfrac{1}{2} \sigma_K^2 \, f''(x) \, h^2 + o(h^2)$
Pointwise variance	$\dfrac{R(K)}{nh} f(x) + o((nh)^{-1})$ , where $R(K) = \int K^2$
Optimal $h$	$h^\star = \left(\dfrac{R(K) \int f}{n \sigma_K^4 \int (f'')^2}\right)^{1/5} \propto n^{-1/5}$
Optimal MISE	$C(K, f) \cdot n^{-4/5}$ in 1D
Rate in $\mathbb{R}^d$	$n^{-4/(4+d)}$ for twice-differentiable $f$

The Gaussian kernel $K(u) = (2\pi)^{-1/2} e^{-u^2/2}$ is the textbook default. The Epanechnikov kernel $K(u) = \tfrac{3}{4}(1 - u^2)\mathbf{1}_{|u|\leq 1}$ is asymptotically optimal among nonnegative kernels in the MISE sense (it minimizes the constant $C(K, f)$ ). The choice rarely matters in practice; the bandwidth does.

Formal Setup

Definition

Rosenblatt-Parzen Kernel Density Estimator $\hat{f}_{h} (x)$

Given an iid sample $X_1, \ldots, X_n$ from an unknown density $f$ on $\mathbb{R}$ , a symmetric nonnegative kernel $K: \mathbb{R} \to \mathbb{R}_{\geq 0}$ with $\int K = 1$ , and a bandwidth $h > 0$ , define $\hat{f}_h(x) = \frac{1}{nh} \sum_{i=1}^n K\!\left(\frac{x - X_i}{h}\right).$ The estimator is itself a probability density. It is nonnegative whenever $K$ is, and integrates to $1$ by the substitution $u = (x - X_i)/h$ .

The bandwidth $h$ controls the bump width. Small $h$ gives a spikey estimator that follows the sample closely; large $h$ gives a smooth estimator that washes out local structure.

Definition

Mean Integrated Squared Error (MISE)

The integrated risk of $\hat{f}_h$ is

\mathrm{MISE}(h) = \mathbb{E}\!\left[\int (\hat{f}_h(x) - f(x))^2\,dx\right] = \int \mathrm{Bias}^2(\hat{f}_h(x))\,dx + \int \mathrm{Var}(\hat{f}_h(x))\,dx.

Bandwidth selection minimizes MISE over $h$ . The minimizer balances bias (which grows in $h$ ) against variance (which shrinks in $h$ ).

Pointwise Bias-Variance Decomposition

Theorem

Asymptotic MISE for KDE

Statement

Under the assumptions above, the pointwise bias and variance of $\hat{f}_h(x)$ admit the expansion $\mathbb{E}[\hat{f}_h(x)] - f(x) = \tfrac{1}{2} \sigma_K^2 \, f''(x) \, h^2 + o(h^2),$ $\mathrm{Var}(\hat{f}_h(x)) = \frac{R(K)}{nh} f(x) + o\!\left(\frac{1}{nh}\right),$ where $R(K) = \int K^2$ and $\sigma_K^2 = \int u^2 K(u)\,du$ . Integrating yields $\mathrm{MISE}(h) = \tfrac{1}{4} \sigma_K^4 h^4 \int (f'')^2 + \frac{R(K)}{nh} + o(h^4 + (nh)^{-1}).$ Minimizing over $h$ gives $h^\star \propto n^{-1/5}$ and $\mathrm{MISE}(h^\star) = C(K, f) n^{-4/5}$ .

Intuition

The bias term has $f''$ in it because the bump averages the density over a window. Where $f$ is concave (a peak) the average pulls down. Where $f$ is convex (a valley) the average pulls up. The second derivative measures that local curvature, and the square of bandwidth measures how wide the averaging window is.

The variance term has $f(x)$ in it because variance in a sample average scales inversely with effective sample size, and the effective sample size at $x$ is roughly $n h f(x)$ (the number of observations falling inside a window of width $h$ centred at $x$ , which is $n$ times the probability mass in that window).

Optimal bandwidth balances $h^4$ against $1/(nh)$ . Equate the orders to get $h \propto n^{-1/5}$ .

Why It Matters

The $n^{-4/5}$ rate is the canonical nonparametric rate for twice-differentiable targets. It is slower than the parametric $n^{-1}$ rate because we are estimating an infinite-dimensional object. Tsybakov (2009) shows that this rate is minimax-optimal over the Hölder class with smoothness $\beta = 2$ : no estimator does better in the worst case.

Failure Mode

Three regimes break the result. (i) $h$ does not satisfy $h \to 0$ and $nh \to \infty$ . Without this the estimator is inconsistent. (ii) $f$ is not twice differentiable (jump discontinuities, mixed atoms, fractal support). Bias picks up a different rate. (iii) Boundaries. At the boundary of the support, the kernel mass on the outside cannot be cancelled and the estimator picks up a $O(h)$ bias rather than $O(h^2)$ . Boundary corrections (reflection, boundary kernels, local polynomial fitting) restore the interior rate.

report a correction →

Optional ProofDerivation of the bias-variance expansionShow

This follows ESL 2nd ed. §6.6.1 (pp. 208-209) and Wasserman 2006 §6.2. Write $K_h(u) = h^{-1} K(u/h)$ . Then $\hat{f}_h(x) = n^{-1} \sum_i K_h(x - X_i)$ is a sum of iid random variables.

Bias. Using a change of variables $u = (x - y)/h$ , $\mathbb{E}[\hat{f}_h(x)] = \int K_h(x - y) f(y)\,dy = \int K(u) f(x - h u)\,du.$ Expand $f(x - hu)$ to second order in $h$ : $f(x - hu) = f(x) - h u f'(x) + \tfrac{1}{2} h^2 u^2 f''(x) + o(h^2).$ Integrate against $K$ . The constant term gives $f(x)$ . The linear term vanishes because $\int u K(u)\,du = 0$ by symmetry. The quadratic term contributes $\tfrac{1}{2} h^2 \sigma_K^2 f''(x)$ .

Variance. By independence, $\mathrm{Var}(\hat{f}_h(x)) = \frac{1}{n}\mathrm{Var}(K_h(x - X_1)).$ The dominant term is $n^{-1} \mathbb{E}[K_h(x - X_1)^2]$ minus the squared mean, which is lower order. Compute $\mathbb{E}[K_h(x - X_1)^2] = \int K_h(x - y)^2 f(y)\,dy = \frac{1}{h} \int K(u)^2 f(x - h u)\,du = \frac{R(K) f(x)}{h} + O(h).$ Divide by $n$ .

MISE. Square the bias and integrate; integrate the variance over $x$ ; add. Optimize the resulting $h^4 + (nh)^{-1}$ by setting the derivative to zero, getting $h^\star = \left(R(K) \int f \, / \, n \sigma_K^4 \int (f'')^2\right)^{1/5}$ .

Bandwidth Selection in Practice

Three families of methods.

Rule of thumb (Silverman, 1986). For a Gaussian target with variance $\hat{\sigma}^2$ and Gaussian kernel, $h_{\mathrm{Silv}} = 1.06 \hat{\sigma} n^{-1/5}$ . Cheap, surprisingly competitive for unimodal smooth densities, badly biased for multimodal or heavy-tailed densities.

Plug-in (Sheather and Jones, 1991). Estimate $\int (f'')^2$ from a pilot bandwidth and substitute into the asymptotic-optimal formula. This is the default in most statistical packages (e.g. R's density(...) uses a plug-in by default).

Cross-validation. Leave-one-out cross-validated MISE selects $h$ that minimizes $\widehat{\mathrm{MISE}}_{\mathrm{LOO}}(h) = \int \hat{f}_h^2 - \frac{2}{n} \sum_{i=1}^n \hat{f}_{h, -i}(X_i),$ where $\hat{f}_{h, -i}$ omits observation $i$ . CV bandwidth is unbiased for MISE up to a constant, but high-variance in practice; the selected $h$ fluctuates substantially across resamples.

Example

Picking a bandwidth for a univariate sample

A sample of $n = 200$ from a $\mathcal{N}(0, 1)$ density. Compute three bandwidths.

Method	$h$
Silverman's rule, $\hat{\sigma} = 1$	$1.06 \cdot 1 \cdot 200^{-1/5} \approx 0.366$
Plug-in (Sheather-Jones)	typically $0.32$ to $0.36$
Leave-one-out CV	high variance; values from $0.25$ to $0.45$ across resamples

For this target, Silverman is exactly the asymptotic-optimal rate (the target is Gaussian; the rule is derived from a Gaussian reference). For a $\mathcal{N}(0, 1) + \mathcal{N}(3, 0.25)$ mixture, Silverman over-smooths; plug-in is more reliable.

The Curse of Dimensionality

In $\mathbb{R}^d$ with a product kernel $K(u) = \prod K_1(u_j)$ and a single bandwidth $h$ , the asymptotic MISE becomes $\mathrm{MISE}(h) = O(h^4) + O((nh^d)^{-1}).$ The variance term scales as $nh^d$ because the effective sample size in a $d$ -dimensional ball of radius $h$ is proportional to $nh^d$ . Optimizing gives $h^\star \propto n^{-1/(4+d)}$ and $\mathrm{MISE}(h^\star) \propto n^{-4/(4+d)}.$ The rate degrades sharply with $d$ . For $d = 1$ the rate is $n^{-4/5}$ , matching the univariate result. For $d = 10$ the rate is $n^{-4/14} \approx n^{-0.286}$ . To reach MISE $= 0.01$ in 1D you need roughly $n \approx 100$ ; in 10D you need roughly $n \approx 10^{17}$ . The curse is real and is not patched by better kernels.

In practice this means KDE is essentially useless past $d \approx 5$ to $7$ without additional structure (sparsity, low intrinsic dimension, parametric components). The same wall hits k-NN density estimation, local polynomial regression, and any other purely-local estimator.

Implementation Notes

The estimator computed naively is $O(n)$ per evaluation point, so evaluating $\hat{f}_h$ on a grid of $m$ points costs $O(mn)$ . Two accelerations matter.

FFT-based evaluation. Bin the data on a grid and convolve with the kernel via FFT. Cost $O(N \log N)$ for $N$ grid points, independent of $n$ . This is what R's density() does. The bin-grid step introduces a small discretization error but the result is visually identical for any reasonable grid resolution.

Tree-based approximation (dual tree). For high $n$ and irregular grids, ball-tree or kd-tree algorithms drop the per-query cost to $O(\log n)$ in moderate dimension. Gray and Moore (2001) and the mlpack implementation cover the details.

Boundary corrections are the other practical issue. The standard fix at a boundary $x = 0$ is to replace the kernel with a boundary kernel $K_b(u)$ that has $\int_0^\infty K_b = 1$ and $\int_0^\infty u K_b = 0$ . Linear local polynomial smoothing (see local polynomial regression) does this automatically and is the cleaner option.

Connection to Naive Bayes and Classification

ESL 2nd ed. §6.6.2 and §6.6.3 (pp. 210-211) cover the use of KDE inside naive Bayes for classification. The naive Bayes classifier estimates $f_k(x)$ for each class $k$ as the product of marginal class-conditional densities $\prod_j \hat{f}_{k,j}(x_j)$ , where each $\hat{f}_{k,j}$ is a univariate KDE. The product assumption is the "naive" part. It sidesteps the curse of dimensionality at the cost of a modelling assumption that is wrong in general but often empirically useful. The mismatch is the source of every cautionary remark about naive Bayes in the textbook.

Canonical Examples

Example

Estimating a bimodal density

Sample 400 points from $\tfrac{1}{2}\mathcal{N}(-1.5, 0.5) + \tfrac{1}{2}\mathcal{N}(1.5, 0.5)$ .

Bandwidth	Visual outcome
$h = 0.05$	spikey; each sample point visible as a bump
$h = 0.30$	two clean modes, near-optimal
$h = 1.00$	single mode; modes blurred away

The plug-in bandwidth selects roughly $h \approx 0.27$ . The Silverman rule of thumb selects $h \approx 0.45$ because $\hat{\sigma} \approx 1.5$ counts the inter-mode spread; the rule treats the sample as unimodal and over-smooths.

Example

A heavy-tailed target

Sample from a Cauchy distribution, $n = 1000$ . The Cauchy has no second moment so $f''$ is heavy-tailed and $\int (f'')^2 = \infty$ . The MISE optimal-bandwidth formula breaks down. In practice the bias term grows in the tails because $f''$ is large there. Local-bandwidth methods (varying $h$ with $x$ ) help; uniform-bandwidth KDE does not.

Common Confusions

Watch Out

The kernel choice barely matters; the bandwidth matters a lot

The kernel choice changes the constant $C(K, f)$ in MISE by at most 10-15% across the standard nonnegative kernels (Epanechnikov is at the lower edge, Gaussian and tricube within a few percent). Changing the bandwidth by a factor of two typically changes MISE by 50-100%. Spend the effort on bandwidth, not kernel.

Watch Out

KDE is not the same as a smoothed histogram

A smoothed histogram first bins the data, then smooths the bin heights. KDE places a kernel directly on each observation. The two coincide only in the limit of zero bin width. With finite bins the smoothed-histogram estimator has additional discretization bias; the KDE does not.

Watch Out

A negative-lobe kernel can have lower bias but is not a density

Higher-order kernels (those with $\int u^j K(u)\,du = 0$ for $j = 1, 2, \ldots, r-1$ ) have bias of order $h^r$ instead of $h^2$ . The MISE rate improves to $n^{-2r/(2r+d)}$ . The catch: such kernels are negative on part of their support, so $\hat{f}_h$ can be negative and is not a density. For pure density estimation this disqualifies them; for purposes where you only need a smooth estimate of $f$ (such as plugin into a downstream computation), they are useful.

Exercises

ExerciseCore

Problem

Let $K(u) = \tfrac{1}{2} \mathbf{1}_{[-1, 1]}(u)$ (the rectangular kernel) and $X_1, \ldots, X_n$ be iid $\mathrm{Uniform}([0, 1])$ . Compute the bias of $\hat{f}_h(x)$ at an interior point $x \in (h, 1 - h)$ , and at the boundary point $x = 0$ . What rate in $h$ does each have?

ExerciseAdvanced

Problem

For the Gaussian kernel $K(u) = \phi(u) = (2\pi)^{-1/2} e^{-u^2/2}$ , compute $R(K) = \int K^2$ and $\sigma_K^2 = \int u^2 K(u)\,du$ . Then derive Silverman's rule $h_{\mathrm{Silv}} = 1.06 \hat{\sigma} n^{-1/5}$ by plugging the Gaussian reference $f = \phi_\sigma$ into the asymptotic optimal formula.

ExerciseResearch

Problem

A density $f$ on $[0, 1]$ has a known jump discontinuity at $x_0 \in (0, 1)$ . The standard KDE picks up a $O(h)$ bias at $x_0$ . Propose a modification that recovers the $O(h^2)$ interior rate. Discuss the sample-size cost of locating the jump from data when $x_0$ is unknown.

References

Canonical:

Hastie, Tibshirani, Friedman. The Elements of Statistical Learning, 2nd ed. Springer (2009). Ch 6 "Kernel Smoothing Methods", §6.6 "Kernel Density Estimation and Classification" (pp. 208-211). Quick textbook treatment with the classification framing.
Wasserman. All of Nonparametric Statistics. Springer (2006). Ch 6 "Density Estimation". The canonical American-statistics graduate treatment; includes plug-in bandwidth and bootstrap variance.
Tsybakov. Introduction to Nonparametric Estimation. Springer (2009). Ch 1 (kernel estimators), Ch 2 (lower bounds). The reference for minimax rates and the Hölder-class formulation.

Foundational:

Rosenblatt, M. (1956). "Remarks on Some Nonparametric Estimates of a Density Function." Annals of Mathematical Statistics 27(3), 832-837. Original construction.
Parzen, E. (1962). "On Estimation of a Probability Density Function and Mode." Annals of Mathematical Statistics 33(3), 1065-1076. Asymptotic analysis.
Silverman, B. W. (1986). Density Estimation for Statistics and Data Analysis. Chapman and Hall. The early monograph; source of the rule of thumb.

Bandwidth selection:

Sheather, S. J. and Jones, M. C. (1991). "A Reliable Data-Based Bandwidth Selection Method for Kernel Density Estimation." Journal of the Royal Statistical Society B 53(3), 683-690. The plug-in default in modern statistical software.

Curse of dimensionality:

Stone, C. J. (1980). "Optimal Rates of Convergence for Nonparametric Estimators." Annals of Statistics 8(6), 1348-1360. Establishes the $n^{-4/(4+d)}$ minimax rate.

Next Topics

Nadaraya-Watson kernel regression: the regression analogue. Same kernel, ratio of weighted sums.
Local polynomial regression: fixes the boundary bias and improves the leading constant.
Smoothing splines: a different route to a smooth nonparametric estimate, with an explicit roughness penalty.
Naive Bayes: KDE as the density estimator inside a generative classifier.

Last reviewed: May 13, 2026

Canonical graph

Required before and derived from this topic

These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.

Full prerequisite chain All derived topics

Required prerequisites

5

Common Probability Distributionslayer 0A · tier 1
Expectation, Variance, Covariance, and Momentslayer 0A · tier 1
Modes of Convergence of Random Variableslayer 0B · tier 1
Bias-Variance Tradeofflayer 2 · tier 2
Kernels and Reproducing Kernel Hilbert Spaceslayer 3 · tier 2

Derived topics

1

Nadaraya-Watson Kernel Regressionlayer 2 · tier 1

Graph-backed continuations

Nadaraya-Watson Kernel Regression