Nadaraya-Watson Kernel Regression

Sneiderman, Robby

Learning Theory

Nadaraya-Watson Kernel Regression

Estimate a conditional mean by a weighted average of nearby labels, with weights given by a kernel. The Nadaraya-Watson estimator, its bias-variance decomposition, optimal bandwidth scaling, boundary bias, and why local polynomial regression is the practical upgrade.

AdvancedAdvancedTier 1StableSupporting~55 min

For:MLStatsResearch

Prerequisites

Expectation Variance Covariance Moments Bias Variance Tradeoff Kernel Density Estimation Linear Regression

Prereq Map

Why This Matters

Given paired observations $(X_i, Y_i)$ , estimating $m(x) = \mathbb{E}[Y \mid X = x]$ without a parametric form is the prototype problem of nonparametric regression. The Nadaraya-Watson estimator does this in the cleanest possible way: it returns a weighted average of the labels $Y_i$ , with weights given by how close $X_i$ is to the query point $x$ under a kernel $K$ and bandwidth $h$ .

The reason it earns its own page on a site that already covers k-NN: NW is the continuous-bandwidth analogue. k-NN uses a hard top- $k$ window with uniform weights inside the window. NW uses a soft window with weights that decay smoothly with distance. The two share the same bias-variance structure but NW has cleaner asymptotic formulas, an explicit minimum-bandwidth rate, and a clean view of where local methods fail (boundaries, sparse regions).

NW is also the cleanest stepping stone to local polynomial regression and to attention. ESL 2nd ed. §6.1 (pp. 192-194) starts with NW for exactly this reason: it makes the bias-variance tradeoff in regression visually obvious, and it sets up the modification (local linear fit) that fixes NW's failure mode. Modern transformer attention is, mathematically, a learned-kernel NW estimator with $x$ replaced by an embedding; see attention as kernel regression.

Quick Version

Object	Form
Estimator $\hat{m}_h(x)$	$\dfrac{\sum_{i=1}^n K\!\left(\frac{x - X_i}{h}\right) Y_i}{\sum_{i=1}^n K\!\left(\frac{x - X_i}{h}\right)}$
Pointwise bias	$h^2 \!\left(\tfrac{1}{2} m''(x) + \dfrac{m'(x) f'(x)}{f(x)}\right) \sigma_K^2 + o(h^2)$
Pointwise variance	$\dfrac{R(K) \sigma^2(x)}{nh f(x)} + o\!\left(\dfrac{1}{nh}\right)$
Optimal $h$	$\propto n^{-1/5}$ in 1D
Optimal MSE	$O(n^{-4/5})$ in 1D
Boundary bias	$O(h)$ , not $O(h^2)$ . Local linear regression restores $O(h^2)$ .

$\sigma^2(x) = \mathrm{Var}(Y \mid X = x)$ and $f$ is the marginal density of $X$ . The bias formula has a design-density term $m'(x) f'(x) / f(x)$ that vanishes only when $X$ is uniform or when $m$ is constant. This term is what makes NW worse than local linear regression in non-uniform designs.

Formal Setup

Definition

Nadaraya-Watson Estimator $\overset{m}{^}_{h} (x)$

Given an iid sample $(X_1, Y_1), \ldots, (X_n, Y_n)$ from a joint distribution on $\mathbb{R} \times \mathbb{R}$ , a symmetric nonnegative kernel $K$ with $\int K = 1$ , and a bandwidth $h > 0$ , define $\hat{m}_h(x) = \frac{\sum_{i=1}^n K_h(x - X_i) Y_i}{\sum_{i=1}^n K_h(x - X_i)}$ where $K_h(u) = h^{-1} K(u/h)$ . The estimator is the weighted average of the $Y_i$ with weights proportional to the kernel evaluation at the distance $|x - X_i|$ .

NW is the solution of the local-constant least-squares problem $\hat{m}_h(x) = \arg\min_{c \in \mathbb{R}} \sum_{i=1}^n K_h(x - X_i) (Y_i - c)^2.$ Differentiating with respect to $c$ and setting the derivative to zero recovers the weighted-average formula. This view exposes the upgrade path: replace "constant $c$ " with "linear function $c_0 + c_1 (X_i - x)$ " and you get local linear regression, which fixes the boundary bias.

Asymptotic Bias-Variance

Theorem

Asymptotic Pointwise MSE for Nadaraya-Watson

Statement

Under the assumptions above, the pointwise bias and variance of the Nadaraya-Watson estimator at an interior point $x$ satisfy $\mathbb{E}[\hat{m}_h(x) \mid X_1, \ldots, X_n] - m(x) = \left(\tfrac{1}{2} m''(x) + \frac{m'(x) f'(x)}{f(x)}\right) \sigma_K^2 h^2 + o_p(h^2),$ $\mathrm{Var}(\hat{m}_h(x) \mid X_1, \ldots, X_n) = \frac{R(K) \sigma^2(x)}{n h f(x)} + o_p((nh)^{-1}).$ Optimal bandwidth minimizes pointwise MSE, giving $h^\star \propto n^{-1/5}$ and pointwise MSE $= O(n^{-4/5})$ .

Intuition

Variance shrinks because more observations contribute to the weighted average as $nh$ grows. The variance term has $f(x)^{-1}$ in it because sparse regions of the input space have less effective sample size.

Bias has two terms. The $m''(x)$ part is the kernel-density analogue: curvature in $m$ averages incorrectly under a finite window. The $m'(x) f'(x) / f(x)$ part is a design effect: in a region where the input density slopes up (say $f'(x) > 0$ ), the weighted average pulls toward the right-hand side where there are more observations. If $m$ also slopes up there ( $m'(x) > 0$ ), the resulting average is biased upward. Local linear regression cancels this term by fitting a local intercept and slope simultaneously.

Why It Matters

The $n^{-4/5}$ MSE rate matches the kernel density estimation rate (KDE) and is minimax-optimal over twice-differentiable targets in 1D (Stone, 1980). The design-density correction is the headline reason ESL 2nd ed. §6.1.1 (pp. 194-196) recommends local linear regression over NW: same rate, cleaner constant, fixes the boundary bias.

Failure Mode

Three failure modes. (i) Sparse input regions where $f(x) \approx 0$ : the denominator becomes small and the estimator becomes unstable. The correct response is to widen the bandwidth locally or to refuse to predict. (ii) Boundary effects: at the edge of the support of $X$ , the kernel window is half-cut off and the bias is $O(h)$ , not $O(h^2)$ . (iii) Heavy tails of $Y$ : $\sigma^2(x) = \infty$ implies the variance term diverges. Robust modifications (median smoothing, $L_1$ local fits) recover consistency at slower rates.

report a correction →

Optional ProofDerivation of the bias-variance expansion (conditional)Show

Following ESL 2nd ed. §6.1.1 and Wasserman 2006 §5.4. Write $\hat{m}_h(x) = N_h(x) / D_h(x)$ where

N_h(x) = \frac{1}{nh} \sum_i K\!\left(\frac{x - X_i}{h}\right) Y_i, \qquad D_h(x) = \frac{1}{nh} \sum_i K\!\left(\frac{x - X_i}{h}\right).

$D_h(x)$ is a standard kernel density estimator at $x$ . By the KDE calculation, $\mathbb{E}[D_h(x)] = f(x) + \tfrac{1}{2} h^2 \sigma_K^2 f''(x) + o(h^2).$

For $N_h$ use the iterated expectation $\mathbb{E}[K_h(x - X) Y] = \mathbb{E}[K_h(x - X) m(X)]$ . Substitute $u = (x - y)/h$ : $\mathbb{E}[N_h(x)] = \int K(u) m(x - hu) f(x - hu)\,du.$ Expand $m(x - hu) f(x - hu)$ to second order in $h$ : $m(x - hu) f(x - hu) = m(x) f(x) - h u (m f)' (x) + \tfrac{1}{2} h^2 u^2 (m f)''(x) + o(h^2).$ The linear term integrates to zero by kernel symmetry. The quadratic term gives $\mathbb{E}[N_h(x)] = m(x) f(x) + \tfrac{1}{2} h^2 \sigma_K^2 (m f)''(x) + o(h^2).$

Now apply $(N_h/D_h) - m = (N_h - m D_h)/D_h$ . After algebra and using $(m f)'' = m'' f + 2 m' f' + m f''$ , the leading bias term is

\frac{h^2 \sigma_K^2}{2} \cdot \frac{(m f)''(x) - m(x) f''(x)}{f(x)} = \frac{h^2 \sigma_K^2}{2} \cdot \frac{m''(x) f(x) + 2 m'(x) f'(x)}{f(x)} = h^2 \sigma_K^2 \left(\tfrac{1}{2} m''(x) + \frac{m'(x) f'(x)}{f(x)}\right).

The variance calculation is the same idea applied to $\mathrm{Var}(N_h) / D_h^2$ : the leading term is $R(K) \sigma^2(x) / (n h f(x))$ .

Boundary Bias and Why Local Linear Wins

At an interior point the bias is $O(h^2)$ . At the boundary $x = a$ (where the support of $X$ ends), half the kernel window sits outside the support and the leading term changes from $O(h^2)$ to $O(h)$ . The quantitative version: at $x = a$ with a symmetric kernel,

\mathbb{E}[\hat{m}_h(a)] - m(a) \approx -h \cdot m'(a) \cdot \mu_K^{(1)} + O(h^2), \qquad \mu_K^{(1)} = \int_0^\infty u K(u)\,du.

The design-density term and the boundary term both vanish if you replace the local-constant fit with a local-linear fit. ESL 2nd ed. §6.1.1 (pp. 194-196) and the canonical reference Fan and Gijbels (1996) work through this in detail. The conclusion: local linear regression has the same asymptotic rate as Nadaraya-Watson but a strictly better constant and no boundary degradation. The choice between them in practice is clear; the only reason to teach NW is pedagogical.

Bandwidth Selection

Same three families as for KDE.

Rule of thumb (Fan and Gijbels, 1996). Plug an estimate of $\int (m'' \cdot f)^2$ into the asymptotic-optimal formula. Conservative.

Cross-validation. Leave-one-out $\widehat{\mathrm{ISE}}(h) = \frac{1}{n} \sum_{i=1}^n (Y_i - \hat{m}_{h, -i}(X_i))^2$ where $\hat{m}_{h, -i}$ omits observation $i$ . This is the canonical choice for kernel regression. It does not require a known design density.

Plug-in. Estimate $m''$ by an oversmoothed pilot, plug into the optimal-bandwidth formula. More efficient than CV asymptotically but sensitive to the pilot bandwidth.

Implementation Notes

Fast NW evaluation uses the same FFT-binning trick as KDE: bin $\{(X_i, Y_i)\}$ on a grid, compute the convolutions of the bin counts and bin label sums against $K$ via FFT, and take the ratio. Cost $O(N \log N)$ for $N$ grid points, independent of $n$ .

For high-dimensional inputs the curse of dimensionality bites the same way it does for KDE: the rate degrades to $n^{-4/(4+d)}$ . NW is essentially unused past $d \approx 5$ to $7$ without structural assumptions. The additive-model framework (generalized additive models) imposes $m(x) = \sum_j m_j(x_j)$ to restore the univariate rate component-wise, at the cost of ruling out interactions.

Canonical Example

Example

Smoothing a noisy sinusoid

Generate $n = 200$ observations $(X_i, Y_i)$ with $X_i \sim \mathrm{Uniform}([0, 2\pi])$ and $Y_i = \sin(X_i) + 0.3 \, \varepsilon_i$ with $\varepsilon_i \sim \mathcal{N}(0, 1)$ .

Fit NW with the Gaussian kernel.

$h$	Visual outcome
$h = 0.05$	Estimator follows the noise; clearly under-smoothed.
$h = 0.30$	Tracks $\sin$ cleanly in the interior; boundary bias visible at $x = 0$ and $x = 2\pi$ .
$h = 1.00$	Smooths the peaks away; amplitude underestimated.

The optimal bandwidth from LOO-CV is around $h \approx 0.25$ for this $n$ and noise level. The pointwise MSE at $x = \pi/2$ is roughly $0.02$ at the optimum, dominated by variance because $m''(\pi/2) = -1$ is not large. At the boundary $x = 0$ the same $h$ gives MSE around $0.06$ : the boundary penalty is the difference. Refitting with local linear at the same $h$ reduces the boundary error roughly in half.

Common Confusions

Watch Out

Nadaraya-Watson is not a special case of k-NN

Both are local methods, but k-NN uses a hard nearest-neighbour window with uniform weights inside. NW uses a soft window with smooth weights. As $h \to 0$ NW does not converge to 1-NN. They are different estimators with the same asymptotic rate.

Watch Out

The bandwidth-versus-kernel question, revisited

Same as for KDE: the bandwidth dominates. Across the standard kernels (Gaussian, Epanechnikov, tricube, uniform) the asymptotic constant varies by 10-15%. Across reasonable bandwidths, MSE varies by 100% or more.

Watch Out

Local-constant is a fit, not an interpolation

NW does not pass through the training points exactly even as $h \to 0$ . Each $\hat{m}_h(X_i)$ is a weighted average of all nearby labels, including the noise at neighbouring observations. The estimator interpolates only in the degenerate limit $h \to 0$ with $K$ a delta, which is not a valid kernel.

Exercises

ExerciseCore

Problem

For NW with the rectangular kernel $K(u) = \tfrac{1}{2} \mathbf{1}_{[-1, 1]}(u)$ and $X_i \sim \mathrm{Uniform}([0, 1])$ , write the estimator $\hat{m}_h(x)$ at $x = 1/2$ explicitly. Argue informally why it has no design-density bias term in this case.

ExerciseAdvanced

Problem

Derive the leading-order boundary bias at $x = 0$ for the Nadaraya-Watson estimator with the support of $X$ equal to $[0, 1]$ . Verify that local linear regression cancels this term.

ExerciseResearch

Problem

The conditional-mean estimator $\hat{m}_h(x)$ can be unstable in sparse regions of the input space. Propose a data-driven local bandwidth $h(x)$ that adapts to local input density, and discuss the bias-variance tradeoff that the adaptation creates.

References

Canonical:

Hastie, Tibshirani, Friedman. The Elements of Statistical Learning, 2nd ed. Springer (2009). Ch 6 "Kernel Smoothing Methods", §6.1 "One-Dimensional Kernel Smoothers" (pp. 192-198). Treats NW, local linear, local polynomial in one tight section.
Wasserman. All of Nonparametric Statistics. Springer (2006). Ch 5 "Nonparametric Regression". The graduate-statistics presentation with optimal-rate proofs.
Györfi, Kohler, Krzyzak, Walk. A Distribution-Free Theory of Nonparametric Regression. Springer (2002). Ch 5-6. Distribution-free rates without smoothness assumptions on the design density.

Foundational:

Nadaraya, E. A. (1964). "On Estimating Regression." Theory of Probability and Its Applications 9(1), 141-142. Two-page note.
Watson, G. S. (1964). "Smooth Regression Analysis." Sankhyā: The Indian Journal of Statistics A 26, 359-372. Concurrent independent derivation.

Local polynomial extension:

Fan, J. and Gijbels, I. (1996). Local Polynomial Modelling and Its Applications. Chapman and Hall. The reference for local linear / local polynomial regression and the boundary-bias analysis.

Minimax rates:

Stone, C. J. (1982). "Optimal Global Rates of Convergence for Nonparametric Regression." Annals of Statistics 10(4), 1040-1053. Establishes the $n^{-4/5}$ rate is unimprovable.

Next Topics

Local polynomial regression: the practical upgrade. Same rate, no boundary degradation.
Smoothing splines: a different nonparametric route, with a roughness penalty rather than a local window.
Attention as kernel regression: transformer attention seen through the NW lens.
Kernel density estimation: the density analogue. Same kernel machinery on a different target.

Last reviewed: May 13, 2026

Canonical graph

Required before and derived from this topic

These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.

Full prerequisite chain All derived topics

Required prerequisites

5

Expectation, Variance, Covariance, and Momentslayer 0A · tier 1
Linear Regressionlayer 1 · tier 1
Kernel Density Estimationlayer 2 · tier 1
Bias-Variance Tradeofflayer 2 · tier 2
Kernels and Reproducing Kernel Hilbert Spaceslayer 3 · tier 2

Derived topics

1

Local Polynomial Regressionlayer 2 · tier 1

Graph-backed continuations

Local Polynomial Regression