Skip to main content

Learning Theory

Nadaraya-Watson Kernel Regression

Estimate a conditional mean by a weighted average of nearby labels, with weights given by a kernel. The Nadaraya-Watson estimator, its bias-variance decomposition, optimal bandwidth scaling, boundary bias, and why local polynomial regression is the practical upgrade.

AdvancedAdvancedTier 1StableSupporting~55 min
For:MLStatsResearch

Why This Matters

Given paired observations (Xi,Yi)(X_i, Y_i), estimating m(x)=E[YX=x]m(x) = \mathbb{E}[Y \mid X = x] without a parametric form is the prototype problem of nonparametric regression. The Nadaraya-Watson estimator does this in the cleanest possible way: it returns a weighted average of the labels YiY_i, with weights given by how close XiX_i is to the query point xx under a kernel KK and bandwidth hh.

The reason it earns its own page on a site that already covers k-NN: NW is the continuous-bandwidth analogue. k-NN uses a hard top-kk window with uniform weights inside the window. NW uses a soft window with weights that decay smoothly with distance. The two share the same bias-variance structure but NW has cleaner asymptotic formulas, an explicit minimum-bandwidth rate, and a clean view of where local methods fail (boundaries, sparse regions).

NW is also the cleanest stepping stone to local polynomial regression and to attention. ESL 2nd ed. §6.1 (pp. 192-194) starts with NW for exactly this reason: it makes the bias-variance tradeoff in regression visually obvious, and it sets up the modification (local linear fit) that fixes NW's failure mode. Modern transformer attention is, mathematically, a learned-kernel NW estimator with xx replaced by an embedding; see attention as kernel regression.

Quick Version

ObjectForm
Estimator m^h(x)\hat{m}_h(x)i=1nK ⁣(xXih)Yii=1nK ⁣(xXih)\dfrac{\sum_{i=1}^n K\!\left(\frac{x - X_i}{h}\right) Y_i}{\sum_{i=1}^n K\!\left(\frac{x - X_i}{h}\right)}
Pointwise biash2 ⁣(12m(x)+m(x)f(x)f(x))σK2+o(h2)h^2 \!\left(\tfrac{1}{2} m''(x) + \dfrac{m'(x) f'(x)}{f(x)}\right) \sigma_K^2 + o(h^2)
Pointwise varianceR(K)σ2(x)nhf(x)+o ⁣(1nh)\dfrac{R(K) \sigma^2(x)}{nh f(x)} + o\!\left(\dfrac{1}{nh}\right)
Optimal hhn1/5\propto n^{-1/5} in 1D
Optimal MSEO(n4/5)O(n^{-4/5}) in 1D
Boundary biasO(h)O(h), not O(h2)O(h^2). Local linear regression restores O(h2)O(h^2).

σ2(x)=Var(YX=x)\sigma^2(x) = \mathrm{Var}(Y \mid X = x) and ff is the marginal density of XX. The bias formula has a design-density term m(x)f(x)/f(x)m'(x) f'(x) / f(x) that vanishes only when XX is uniform or when mm is constant. This term is what makes NW worse than local linear regression in non-uniform designs.

Formal Setup

Definition

Nadaraya-Watson Estimator

Given an iid sample (X1,Y1),,(Xn,Yn)(X_1, Y_1), \ldots, (X_n, Y_n) from a joint distribution on R×R\mathbb{R} \times \mathbb{R}, a symmetric nonnegative kernel KK with K=1\int K = 1, and a bandwidth h>0h > 0, define m^h(x)=i=1nKh(xXi)Yii=1nKh(xXi)\hat{m}_h(x) = \frac{\sum_{i=1}^n K_h(x - X_i) Y_i}{\sum_{i=1}^n K_h(x - X_i)} where Kh(u)=h1K(u/h)K_h(u) = h^{-1} K(u/h). The estimator is the weighted average of the YiY_i with weights proportional to the kernel evaluation at the distance xXi|x - X_i|.

NW is the solution of the local-constant least-squares problem m^h(x)=argmincRi=1nKh(xXi)(Yic)2.\hat{m}_h(x) = \arg\min_{c \in \mathbb{R}} \sum_{i=1}^n K_h(x - X_i) (Y_i - c)^2. Differentiating with respect to cc and setting the derivative to zero recovers the weighted-average formula. This view exposes the upgrade path: replace "constant cc" with "linear function c0+c1(Xix)c_0 + c_1 (X_i - x)" and you get local linear regression, which fixes the boundary bias.

Asymptotic Bias-Variance

Theorem

Asymptotic Pointwise MSE for Nadaraya-Watson

Statement

Under the assumptions above, the pointwise bias and variance of the Nadaraya-Watson estimator at an interior point xx satisfy E[m^h(x)X1,,Xn]m(x)=(12m(x)+m(x)f(x)f(x))σK2h2+op(h2),\mathbb{E}[\hat{m}_h(x) \mid X_1, \ldots, X_n] - m(x) = \left(\tfrac{1}{2} m''(x) + \frac{m'(x) f'(x)}{f(x)}\right) \sigma_K^2 h^2 + o_p(h^2), Var(m^h(x)X1,,Xn)=R(K)σ2(x)nhf(x)+op((nh)1).\mathrm{Var}(\hat{m}_h(x) \mid X_1, \ldots, X_n) = \frac{R(K) \sigma^2(x)}{n h f(x)} + o_p((nh)^{-1}). Optimal bandwidth minimizes pointwise MSE, giving hn1/5h^\star \propto n^{-1/5} and pointwise MSE =O(n4/5)= O(n^{-4/5}).

Intuition

Variance shrinks because more observations contribute to the weighted average as nhnh grows. The variance term has f(x)1f(x)^{-1} in it because sparse regions of the input space have less effective sample size.

Bias has two terms. The m(x)m''(x) part is the kernel-density analogue: curvature in mm averages incorrectly under a finite window. The m(x)f(x)/f(x)m'(x) f'(x) / f(x) part is a design effect: in a region where the input density slopes up (say f(x)>0f'(x) > 0), the weighted average pulls toward the right-hand side where there are more observations. If mm also slopes up there (m(x)>0m'(x) > 0), the resulting average is biased upward. Local linear regression cancels this term by fitting a local intercept and slope simultaneously.

Why It Matters

The n4/5n^{-4/5} MSE rate matches the kernel density estimation rate (KDE) and is minimax-optimal over twice-differentiable targets in 1D (Stone, 1980). The design-density correction is the headline reason ESL 2nd ed. §6.1.1 (pp. 194-196) recommends local linear regression over NW: same rate, cleaner constant, fixes the boundary bias.

Failure Mode

Three failure modes. (i) Sparse input regions where f(x)0f(x) \approx 0: the denominator becomes small and the estimator becomes unstable. The correct response is to widen the bandwidth locally or to refuse to predict. (ii) Boundary effects: at the edge of the support of XX, the kernel window is half-cut off and the bias is O(h)O(h), not O(h2)O(h^2). (iii) Heavy tails of YY: σ2(x)=\sigma^2(x) = \infty implies the variance term diverges. Robust modifications (median smoothing, L1L_1 local fits) recover consistency at slower rates.

Optional ProofDerivation of the bias-variance expansion (conditional)Show

Following ESL 2nd ed. §6.1.1 and Wasserman 2006 §5.4. Write m^h(x)=Nh(x)/Dh(x)\hat{m}_h(x) = N_h(x) / D_h(x) where

Nh(x)=1nhiK ⁣(xXih)Yi,Dh(x)=1nhiK ⁣(xXih).N_h(x) = \frac{1}{nh} \sum_i K\!\left(\frac{x - X_i}{h}\right) Y_i, \qquad D_h(x) = \frac{1}{nh} \sum_i K\!\left(\frac{x - X_i}{h}\right).

Dh(x)D_h(x) is a standard kernel density estimator at xx. By the KDE calculation, E[Dh(x)]=f(x)+12h2σK2f(x)+o(h2).\mathbb{E}[D_h(x)] = f(x) + \tfrac{1}{2} h^2 \sigma_K^2 f''(x) + o(h^2).

For NhN_h use the iterated expectation E[Kh(xX)Y]=E[Kh(xX)m(X)]\mathbb{E}[K_h(x - X) Y] = \mathbb{E}[K_h(x - X) m(X)]. Substitute u=(xy)/hu = (x - y)/h: E[Nh(x)]=K(u)m(xhu)f(xhu)du.\mathbb{E}[N_h(x)] = \int K(u) m(x - hu) f(x - hu)\,du. Expand m(xhu)f(xhu)m(x - hu) f(x - hu) to second order in hh: m(xhu)f(xhu)=m(x)f(x)hu(mf)(x)+12h2u2(mf)(x)+o(h2).m(x - hu) f(x - hu) = m(x) f(x) - h u (m f)' (x) + \tfrac{1}{2} h^2 u^2 (m f)''(x) + o(h^2). The linear term integrates to zero by kernel symmetry. The quadratic term gives E[Nh(x)]=m(x)f(x)+12h2σK2(mf)(x)+o(h2).\mathbb{E}[N_h(x)] = m(x) f(x) + \tfrac{1}{2} h^2 \sigma_K^2 (m f)''(x) + o(h^2).

Now apply (Nh/Dh)m=(NhmDh)/Dh(N_h/D_h) - m = (N_h - m D_h)/D_h. After algebra and using (mf)=mf+2mf+mf(m f)'' = m'' f + 2 m' f' + m f'', the leading bias term is

h2σK22(mf)(x)m(x)f(x)f(x)=h2σK22m(x)f(x)+2m(x)f(x)f(x)=h2σK2(12m(x)+m(x)f(x)f(x)).\frac{h^2 \sigma_K^2}{2} \cdot \frac{(m f)''(x) - m(x) f''(x)}{f(x)} = \frac{h^2 \sigma_K^2}{2} \cdot \frac{m''(x) f(x) + 2 m'(x) f'(x)}{f(x)} = h^2 \sigma_K^2 \left(\tfrac{1}{2} m''(x) + \frac{m'(x) f'(x)}{f(x)}\right).

The variance calculation is the same idea applied to Var(Nh)/Dh2\mathrm{Var}(N_h) / D_h^2: the leading term is R(K)σ2(x)/(nhf(x))R(K) \sigma^2(x) / (n h f(x)).

Boundary Bias and Why Local Linear Wins

At an interior point the bias is O(h2)O(h^2). At the boundary x=ax = a (where the support of XX ends), half the kernel window sits outside the support and the leading term changes from O(h2)O(h^2) to O(h)O(h). The quantitative version: at x=ax = a with a symmetric kernel,

E[m^h(a)]m(a)hm(a)μK(1)+O(h2),μK(1)=0uK(u)du.\mathbb{E}[\hat{m}_h(a)] - m(a) \approx -h \cdot m'(a) \cdot \mu_K^{(1)} + O(h^2), \qquad \mu_K^{(1)} = \int_0^\infty u K(u)\,du.

The design-density term and the boundary term both vanish if you replace the local-constant fit with a local-linear fit. ESL 2nd ed. §6.1.1 (pp. 194-196) and the canonical reference Fan and Gijbels (1996) work through this in detail. The conclusion: local linear regression has the same asymptotic rate as Nadaraya-Watson but a strictly better constant and no boundary degradation. The choice between them in practice is clear; the only reason to teach NW is pedagogical.

Bandwidth Selection

Same three families as for KDE.

Rule of thumb (Fan and Gijbels, 1996). Plug an estimate of (mf)2\int (m'' \cdot f)^2 into the asymptotic-optimal formula. Conservative.

Cross-validation. Leave-one-out ISE^(h)=1ni=1n(Yim^h,i(Xi))2\widehat{\mathrm{ISE}}(h) = \frac{1}{n} \sum_{i=1}^n (Y_i - \hat{m}_{h, -i}(X_i))^2 where m^h,i\hat{m}_{h, -i} omits observation ii. This is the canonical choice for kernel regression. It does not require a known design density.

Plug-in. Estimate mm'' by an oversmoothed pilot, plug into the optimal-bandwidth formula. More efficient than CV asymptotically but sensitive to the pilot bandwidth.

Implementation Notes

Fast NW evaluation uses the same FFT-binning trick as KDE: bin {(Xi,Yi)}\{(X_i, Y_i)\} on a grid, compute the convolutions of the bin counts and bin label sums against KK via FFT, and take the ratio. Cost O(NlogN)O(N \log N) for NN grid points, independent of nn.

For high-dimensional inputs the curse of dimensionality bites the same way it does for KDE: the rate degrades to n4/(4+d)n^{-4/(4+d)}. NW is essentially unused past d5d \approx 5 to 77 without structural assumptions. The additive-model framework (generalized additive models) imposes m(x)=jmj(xj)m(x) = \sum_j m_j(x_j) to restore the univariate rate component-wise, at the cost of ruling out interactions.

Canonical Example

Example

Smoothing a noisy sinusoid

Generate n=200n = 200 observations (Xi,Yi)(X_i, Y_i) with XiUniform([0,2π])X_i \sim \mathrm{Uniform}([0, 2\pi]) and Yi=sin(Xi)+0.3εiY_i = \sin(X_i) + 0.3 \, \varepsilon_i with εiN(0,1)\varepsilon_i \sim \mathcal{N}(0, 1).

Fit NW with the Gaussian kernel.

hhVisual outcome
h=0.05h = 0.05Estimator follows the noise; clearly under-smoothed.
h=0.30h = 0.30Tracks sin\sin cleanly in the interior; boundary bias visible at x=0x = 0 and x=2πx = 2\pi.
h=1.00h = 1.00Smooths the peaks away; amplitude underestimated.

The optimal bandwidth from LOO-CV is around h0.25h \approx 0.25 for this nn and noise level. The pointwise MSE at x=π/2x = \pi/2 is roughly 0.020.02 at the optimum, dominated by variance because m(π/2)=1m''(\pi/2) = -1 is not large. At the boundary x=0x = 0 the same hh gives MSE around 0.060.06: the boundary penalty is the difference. Refitting with local linear at the same hh reduces the boundary error roughly in half.

Common Confusions

Watch Out

Nadaraya-Watson is not a special case of k-NN

Both are local methods, but k-NN uses a hard nearest-neighbour window with uniform weights inside. NW uses a soft window with smooth weights. As h0h \to 0 NW does not converge to 1-NN. They are different estimators with the same asymptotic rate.

Watch Out

The bandwidth-versus-kernel question, revisited

Same as for KDE: the bandwidth dominates. Across the standard kernels (Gaussian, Epanechnikov, tricube, uniform) the asymptotic constant varies by 10-15%. Across reasonable bandwidths, MSE varies by 100% or more.

Watch Out

Local-constant is a fit, not an interpolation

NW does not pass through the training points exactly even as h0h \to 0. Each m^h(Xi)\hat{m}_h(X_i) is a weighted average of all nearby labels, including the noise at neighbouring observations. The estimator interpolates only in the degenerate limit h0h \to 0 with KK a delta, which is not a valid kernel.

Exercises

ExerciseCore

Problem

For NW with the rectangular kernel K(u)=121[1,1](u)K(u) = \tfrac{1}{2} \mathbf{1}_{[-1, 1]}(u) and XiUniform([0,1])X_i \sim \mathrm{Uniform}([0, 1]), write the estimator m^h(x)\hat{m}_h(x) at x=1/2x = 1/2 explicitly. Argue informally why it has no design-density bias term in this case.

ExerciseAdvanced

Problem

Derive the leading-order boundary bias at x=0x = 0 for the Nadaraya-Watson estimator with the support of XX equal to [0,1][0, 1]. Verify that local linear regression cancels this term.

ExerciseResearch

Problem

The conditional-mean estimator m^h(x)\hat{m}_h(x) can be unstable in sparse regions of the input space. Propose a data-driven local bandwidth h(x)h(x) that adapts to local input density, and discuss the bias-variance tradeoff that the adaptation creates.

References

Canonical:

  • Hastie, Tibshirani, Friedman. The Elements of Statistical Learning, 2nd ed. Springer (2009). Ch 6 "Kernel Smoothing Methods", §6.1 "One-Dimensional Kernel Smoothers" (pp. 192-198). Treats NW, local linear, local polynomial in one tight section.
  • Wasserman. All of Nonparametric Statistics. Springer (2006). Ch 5 "Nonparametric Regression". The graduate-statistics presentation with optimal-rate proofs.
  • Györfi, Kohler, Krzyzak, Walk. A Distribution-Free Theory of Nonparametric Regression. Springer (2002). Ch 5-6. Distribution-free rates without smoothness assumptions on the design density.

Foundational:

  • Nadaraya, E. A. (1964). "On Estimating Regression." Theory of Probability and Its Applications 9(1), 141-142. Two-page note.
  • Watson, G. S. (1964). "Smooth Regression Analysis." Sankhyā: The Indian Journal of Statistics A 26, 359-372. Concurrent independent derivation.

Local polynomial extension:

  • Fan, J. and Gijbels, I. (1996). Local Polynomial Modelling and Its Applications. Chapman and Hall. The reference for local linear / local polynomial regression and the boundary-bias analysis.

Minimax rates:

  • Stone, C. J. (1982). "Optimal Global Rates of Convergence for Nonparametric Regression." Annals of Statistics 10(4), 1040-1053. Establishes the n4/5n^{-4/5} rate is unimprovable.

Next Topics

Last reviewed: May 13, 2026

Canonical graph

Required before and derived from this topic

These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.

Required prerequisites

5

Derived topics

1

Graph-backed continuations