Learning Theory
Nadaraya-Watson Kernel Regression
Estimate a conditional mean by a weighted average of nearby labels, with weights given by a kernel. The Nadaraya-Watson estimator, its bias-variance decomposition, optimal bandwidth scaling, boundary bias, and why local polynomial regression is the practical upgrade.
Prerequisites
Why This Matters
Given paired observations , estimating without a parametric form is the prototype problem of nonparametric regression. The Nadaraya-Watson estimator does this in the cleanest possible way: it returns a weighted average of the labels , with weights given by how close is to the query point under a kernel and bandwidth .
The reason it earns its own page on a site that already covers k-NN: NW is the continuous-bandwidth analogue. k-NN uses a hard top- window with uniform weights inside the window. NW uses a soft window with weights that decay smoothly with distance. The two share the same bias-variance structure but NW has cleaner asymptotic formulas, an explicit minimum-bandwidth rate, and a clean view of where local methods fail (boundaries, sparse regions).
NW is also the cleanest stepping stone to local polynomial regression and to attention. ESL 2nd ed. §6.1 (pp. 192-194) starts with NW for exactly this reason: it makes the bias-variance tradeoff in regression visually obvious, and it sets up the modification (local linear fit) that fixes NW's failure mode. Modern transformer attention is, mathematically, a learned-kernel NW estimator with replaced by an embedding; see attention as kernel regression.
Quick Version
| Object | Form |
|---|---|
| Estimator | |
| Pointwise bias | |
| Pointwise variance | |
| Optimal | in 1D |
| Optimal MSE | in 1D |
| Boundary bias | , not . Local linear regression restores . |
and is the marginal density of . The bias formula has a design-density term that vanishes only when is uniform or when is constant. This term is what makes NW worse than local linear regression in non-uniform designs.
Formal Setup
Nadaraya-Watson Estimator
Given an iid sample from a joint distribution on , a symmetric nonnegative kernel with , and a bandwidth , define where . The estimator is the weighted average of the with weights proportional to the kernel evaluation at the distance .
NW is the solution of the local-constant least-squares problem Differentiating with respect to and setting the derivative to zero recovers the weighted-average formula. This view exposes the upgrade path: replace "constant " with "linear function " and you get local linear regression, which fixes the boundary bias.
Asymptotic Bias-Variance
Asymptotic Pointwise MSE for Nadaraya-Watson
Statement
Under the assumptions above, the pointwise bias and variance of the Nadaraya-Watson estimator at an interior point satisfy Optimal bandwidth minimizes pointwise MSE, giving and pointwise MSE .
Intuition
Variance shrinks because more observations contribute to the weighted average as grows. The variance term has in it because sparse regions of the input space have less effective sample size.
Bias has two terms. The part is the kernel-density analogue: curvature in averages incorrectly under a finite window. The part is a design effect: in a region where the input density slopes up (say ), the weighted average pulls toward the right-hand side where there are more observations. If also slopes up there (), the resulting average is biased upward. Local linear regression cancels this term by fitting a local intercept and slope simultaneously.
Why It Matters
The MSE rate matches the kernel density estimation rate (KDE) and is minimax-optimal over twice-differentiable targets in 1D (Stone, 1980). The design-density correction is the headline reason ESL 2nd ed. §6.1.1 (pp. 194-196) recommends local linear regression over NW: same rate, cleaner constant, fixes the boundary bias.
Failure Mode
Three failure modes. (i) Sparse input regions where : the denominator becomes small and the estimator becomes unstable. The correct response is to widen the bandwidth locally or to refuse to predict. (ii) Boundary effects: at the edge of the support of , the kernel window is half-cut off and the bias is , not . (iii) Heavy tails of : implies the variance term diverges. Robust modifications (median smoothing, local fits) recover consistency at slower rates.
Optional ProofDerivation of the bias-variance expansion (conditional)Show
Following ESL 2nd ed. §6.1.1 and Wasserman 2006 §5.4. Write where
is a standard kernel density estimator at . By the KDE calculation,
For use the iterated expectation . Substitute : Expand to second order in : The linear term integrates to zero by kernel symmetry. The quadratic term gives
Now apply . After algebra and using , the leading bias term is
The variance calculation is the same idea applied to : the leading term is .
Boundary Bias and Why Local Linear Wins
At an interior point the bias is . At the boundary (where the support of ends), half the kernel window sits outside the support and the leading term changes from to . The quantitative version: at with a symmetric kernel,
The design-density term and the boundary term both vanish if you replace the local-constant fit with a local-linear fit. ESL 2nd ed. §6.1.1 (pp. 194-196) and the canonical reference Fan and Gijbels (1996) work through this in detail. The conclusion: local linear regression has the same asymptotic rate as Nadaraya-Watson but a strictly better constant and no boundary degradation. The choice between them in practice is clear; the only reason to teach NW is pedagogical.
Bandwidth Selection
Same three families as for KDE.
Rule of thumb (Fan and Gijbels, 1996). Plug an estimate of into the asymptotic-optimal formula. Conservative.
Cross-validation. Leave-one-out where omits observation . This is the canonical choice for kernel regression. It does not require a known design density.
Plug-in. Estimate by an oversmoothed pilot, plug into the optimal-bandwidth formula. More efficient than CV asymptotically but sensitive to the pilot bandwidth.
Implementation Notes
Fast NW evaluation uses the same FFT-binning trick as KDE: bin on a grid, compute the convolutions of the bin counts and bin label sums against via FFT, and take the ratio. Cost for grid points, independent of .
For high-dimensional inputs the curse of dimensionality bites the same way it does for KDE: the rate degrades to . NW is essentially unused past to without structural assumptions. The additive-model framework (generalized additive models) imposes to restore the univariate rate component-wise, at the cost of ruling out interactions.
Canonical Example
Smoothing a noisy sinusoid
Generate observations with and with .
Fit NW with the Gaussian kernel.
| Visual outcome | |
|---|---|
| Estimator follows the noise; clearly under-smoothed. | |
| Tracks cleanly in the interior; boundary bias visible at and . | |
| Smooths the peaks away; amplitude underestimated. |
The optimal bandwidth from LOO-CV is around for this and noise level. The pointwise MSE at is roughly at the optimum, dominated by variance because is not large. At the boundary the same gives MSE around : the boundary penalty is the difference. Refitting with local linear at the same reduces the boundary error roughly in half.
Common Confusions
Nadaraya-Watson is not a special case of k-NN
Both are local methods, but k-NN uses a hard nearest-neighbour window with uniform weights inside. NW uses a soft window with smooth weights. As NW does not converge to 1-NN. They are different estimators with the same asymptotic rate.
The bandwidth-versus-kernel question, revisited
Same as for KDE: the bandwidth dominates. Across the standard kernels (Gaussian, Epanechnikov, tricube, uniform) the asymptotic constant varies by 10-15%. Across reasonable bandwidths, MSE varies by 100% or more.
Local-constant is a fit, not an interpolation
NW does not pass through the training points exactly even as . Each is a weighted average of all nearby labels, including the noise at neighbouring observations. The estimator interpolates only in the degenerate limit with a delta, which is not a valid kernel.
Exercises
Problem
For NW with the rectangular kernel and , write the estimator at explicitly. Argue informally why it has no design-density bias term in this case.
Problem
Derive the leading-order boundary bias at for the Nadaraya-Watson estimator with the support of equal to . Verify that local linear regression cancels this term.
Problem
The conditional-mean estimator can be unstable in sparse regions of the input space. Propose a data-driven local bandwidth that adapts to local input density, and discuss the bias-variance tradeoff that the adaptation creates.
References
Canonical:
- Hastie, Tibshirani, Friedman. The Elements of Statistical Learning, 2nd ed. Springer (2009). Ch 6 "Kernel Smoothing Methods", §6.1 "One-Dimensional Kernel Smoothers" (pp. 192-198). Treats NW, local linear, local polynomial in one tight section.
- Wasserman. All of Nonparametric Statistics. Springer (2006). Ch 5 "Nonparametric Regression". The graduate-statistics presentation with optimal-rate proofs.
- Györfi, Kohler, Krzyzak, Walk. A Distribution-Free Theory of Nonparametric Regression. Springer (2002). Ch 5-6. Distribution-free rates without smoothness assumptions on the design density.
Foundational:
- Nadaraya, E. A. (1964). "On Estimating Regression." Theory of Probability and Its Applications 9(1), 141-142. Two-page note.
- Watson, G. S. (1964). "Smooth Regression Analysis." Sankhyā: The Indian Journal of Statistics A 26, 359-372. Concurrent independent derivation.
Local polynomial extension:
- Fan, J. and Gijbels, I. (1996). Local Polynomial Modelling and Its Applications. Chapman and Hall. The reference for local linear / local polynomial regression and the boundary-bias analysis.
Minimax rates:
- Stone, C. J. (1982). "Optimal Global Rates of Convergence for Nonparametric Regression." Annals of Statistics 10(4), 1040-1053. Establishes the rate is unimprovable.
Next Topics
- Local polynomial regression: the practical upgrade. Same rate, no boundary degradation.
- Smoothing splines: a different nonparametric route, with a roughness penalty rather than a local window.
- Attention as kernel regression: transformer attention seen through the NW lens.
- Kernel density estimation: the density analogue. Same kernel machinery on a different target.
Last reviewed: May 13, 2026
Canonical graph
Required before and derived from this topic
These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.
Required prerequisites
5- Expectation, Variance, Covariance, and Momentslayer 0A · tier 1
- Linear Regressionlayer 1 · tier 1
- Kernel Density Estimationlayer 2 · tier 1
- Bias-Variance Tradeofflayer 2 · tier 2
- Kernels and Reproducing Kernel Hilbert Spaceslayer 3 · tier 2
Derived topics
1- Local Polynomial Regressionlayer 2 · tier 1
Graph-backed continuations