Learning Theory
Local Polynomial Regression
Replace the local-constant fit of Nadaraya-Watson with a local-degree-p polynomial fit. Same n^{-4/5} rate as NW, no boundary bias, automatic design-density correction, and the bias-variance asymmetry between odd and even degrees that makes p=1 the practical default.
Prerequisites
Why This Matters
The Nadaraya-Watson estimator fits a constant at every query point. Local polynomial regression replaces the constant with a low-degree polynomial. The change is small to describe and large in effect: the boundary bias goes from to , the design-density correction term in the interior bias drops out, and the implementation is a single weighted least squares per query point. ESL 2nd ed. §6.1.1 (p. 194) introduces this as the canonical fix for NW's boundary problem.
The reason this earns its own page on a site that already has NW: local
polynomial regression is the version that actually ships in statistical
software. R's loess(), scipy's lowess, and every textbook
implementation of kernel regression past 1996 default to local linear,
not local constant. The asymptotic theory is in Fan and Gijbels (1996),
which is also where the "odd degrees are better" observation appears
explicitly. The same machinery becomes the building block for
generalized additive models when
each component is fit independently.
Quick Version
| Object | Form |
|---|---|
| Estimator at | Solve ; return . |
| Local-linear bias () | in interior and at boundary |
| Local-linear variance () | |
| Optimal | |
| Optimal MSE | |
| Local cubic bias () | at the cost of higher variance |
| Odd-degree advantage | beats ; beats in interior bias |
The design-density term that plagues NW vanishes for local-linear and above. The boundary bias drops to at the boundary; for NW it stays at . These two improvements are essentially free.
Formal Setup
Local Polynomial Estimator of Degree p
Given , a kernel , a bandwidth , and a polynomial degree , solve The local polynomial estimator is , the intercept of the local fit. The slope is an estimator of , the local second derivative estimates , and so on.
In matrix form, with the design with rows and , The case recovers Nadaraya-Watson. The case is "local linear regression" and is the default in practice.
Bias-Variance for Local Linear
Asymptotic Pointwise MSE for Local Linear Regression
Statement
Under the assumptions above, the local linear estimator () at an interior point satisfies At a boundary point , the bias remains with leading constant depending on a boundary-adjusted moment of over .
Intuition
The local-constant fit absorbs only the level. Bias picks up both curvature and design-density slope because the constant is averaged over a tilted window. The local-linear fit also absorbs the slope. Curvature remains because the polynomial degree is too low to absorb the quadratic part of , but the design-density term cancels: it lives in the linear part of the local Taylor expansion, and that linear part is now fitted away.
At the boundary, the same cancellation works because the local-linear fit adjusts the slope at the boundary. NW cannot do this because it has no slope parameter; the boundary-window asymmetry then leaks into the level.
Why It Matters
This is the result that elevates local polynomial regression over NW in every applied context. Same asymptotic rate; strictly better bias; no boundary penalty; no extra computation past one weighted least squares. ESL 2nd ed. §6.1.1 (p. 194) and Fan and Gijbels (1996) Ch 3 are the standard references.
Failure Mode
The local linear estimator has higher variance than NW by a constant factor that depends on . The trade is worth it almost always, but if the bias is genuinely zero (the truth is a constant, say) NW has lower finite-sample MSE. Higher-degree fits () have lower bias but the variance grows multiplicatively in ; degree is the universal default and is the right choice when the curvature is large and the sample size supports it.
Optional ProofDerivation of the local linear bias-variance formulasShow
Following Fan and Gijbels (1996) Ch 3 and ESL 2nd ed. §6.1.1.
Define the local moments . For local linear (),
Expectation calculations use where . Since is symmetric, for odd . Hence , which is the key fact that wipes out the design-density correction.
Plug Taylor expansion into the numerator and use the moment relations. The constant and linear parts of are absorbed exactly by the local-linear fit. The quadratic part contributes . The contribution of the design density slope enters through but multiplied by a factor proportional to , so it vanishes.
At a boundary the kernel moments change because the kernel is effectively truncated to . The local-linear estimator re-weights to make the truncated effectively zero in the cancellation; the leading bias remains with a boundary-adjusted constant. See Fan and Gijbels (1996) §3.3.
Why Odd Degrees Are Better
A purely-asymptotic fact, derived in Fan and Gijbels (1996) Ch 3.
The asymptotic interior bias of is
For even, is odd, by symmetry, but the design-density correction does not vanish, so the bias has rate times a nonzero constant that depends on the input distribution. For odd, is even, , but the design-density correction vanishes. Both have asymptotic bias in their leading order but the odd-degree leading constant is simpler.
What "odd degrees are better" buys you in practice: going from (NW) to (local linear) gets you a free upgrade. Going from to does not improve the rate at all (both are bias). The next non-trivial upgrade is , which moves bias from to at the cost of substantially higher variance.
Bandwidth Selection
Same machinery as NW. Leave-one-out cross-validation is the dominant choice in practice for local linear; plug-in methods using estimated work and are slightly more efficient asymptotically.
A practical refinement: variable bandwidth. Different bandwidths at
different help in regions where curvature varies. The standard fix
is to flatten variance across the
support. This is what loess does by selecting an effective neighbourhood
size rather than a global bandwidth.
Implementation Notes
The per-query-point cost is for naive evaluation: compute all weights , build the weighted normal equations, solve a system. Evaluating on a grid of points costs .
Two practical accelerations.
Truncate the kernel. For kernels with bounded support (Epanechnikov, tricube) only the observations within of the query point have nonzero weight. Use a sort or a kd-tree on to find them in per query, dropping per-query cost to where is the local count.
Updating along a grid. Adjacent grid points share most of their
neighbours. Incrementally update the sums and the normal-equation matrix
in as you slide. This is how the classic loess algorithm
amortizes.
The R reference implementation is loess() in the standard stats
package; the canonical software paper is Cleveland and Devlin (1988).
Canonical Example
Boundary fix compared head-to-head
Generate from and with iid .
Fit three estimators with the Epanechnikov kernel and a common bandwidth chosen by leave-one-out CV.
| Estimator | MSE on | MSE on boundary |
|---|---|---|
| Nadaraya-Watson () | 0.020 | 0.18 |
| Local linear () | 0.018 | 0.038 |
| Local cubic () | 0.016 | 0.041 |
Local linear cuts the boundary MSE by roughly without paying anything in the interior. Local cubic improves the interior bias further (the function is smooth, so the bias kicks in) but adds variance, visible as a small jump in the boundary MSE compared to local linear.
The bandwidth that LOO-CV selects is roughly for and roughly for : higher-degree fits need wider windows because the local fit has more parameters to estimate.
Common Confusions
Local linear is not the same as a linear regression on transformed coordinates
Local linear fits a different polynomial at each query point. There is no global linear function; the output is genuinely nonlinear in . The "linear" refers to the polynomial degree of the local fit, not to a global model.
Going from p=1 to p=2 does not improve the rate
The interior MSE rate of is for odd and the same for replaced by , so the rate gain comes in odd-to-odd steps. and have the same bias rate and the same MSE rate; pays variance for nothing. Use or .
Variable bandwidth is a different decision from kernel choice
Choosing a kernel and choosing whether the bandwidth varies with are orthogonal. The "loess" of Cleveland is local linear with Epanechnikov-like weighting AND a span-based variable bandwidth (the bandwidth is set so that a fixed fraction of the data lies in the window at every ). The two choices are independent and can be mixed freely.
Exercises
Problem
Show that the local-linear estimator at can be written as a linear combination of the : with weights depending on the and the kernel but not on the . Verify and .
Problem
A function is smooth in the interior but has a kink at (i.e. is continuous but has a jump). Show that the local-linear estimator has bias in a neighbourhood of , not . Sketch a wavelet- or local-polynomial-with-kink-detection estimator that recovers away from the kink.
Problem
For local cubic () regression with bandwidth and the Gaussian kernel, compute the leading bias constant for the term. Compare with the leading variance constant in the asymptotic MSE. At what bandwidth does the variance dominate the bias?
References
Canonical:
- Hastie, Tibshirani, Friedman. The Elements of Statistical Learning, 2nd ed. Springer (2009). Ch 6 "Kernel Smoothing Methods", §6.1.1 "Local Linear Regression" (pp. 194-196), §6.1.2 "Local Polynomial Regression" (pp. 197-198). The textbook account, including the boundary-fix motivation.
- Fan, J. and Gijbels, I. (1996). Local Polynomial Modelling and Its Applications. Chapman and Hall. The definitive monograph; asymptotic theory, bandwidth selection, variable bandwidth.
- Wasserman. All of Nonparametric Statistics. Springer (2006). Ch 5.4 (pp. 71-77). Concise graduate-statistics treatment.
Foundational:
- Stone, C. J. (1977). "Consistent Nonparametric Regression." Annals of Statistics 5(4), 595-620. Original framework.
- Cleveland, W. S. (1979). "Robust Locally Weighted Regression and Smoothing Scatterplots." Journal of the American Statistical Association 74(368), 829-836. Introduces the loess refinement (robust local linear with iterative re-weighting).
- Cleveland, W. S. and Devlin, S. J. (1988). "Locally Weighted Regression: An Approach to Regression Analysis by Local Fitting." Journal of the American Statistical Association 83(403), 596-610. The standard reference for the software.
Higher dimensions:
- Ruppert, D. and Wand, M. P. (1994). "Multivariate Locally Weighted Least Squares Regression." Annals of Statistics 22(3), 1346-1370. Extension to with multivariate kernels.
Generalized additive models bridge:
- Hastie, T. J. and Tibshirani, R. J. (1990). Generalized Additive Models. Chapman and Hall. Local polynomial as the building block for the backfitting algorithm.
Next Topics
- Smoothing splines: the penalty-based alternative to local fitting. Equivalent kernel; cleaner global fit.
- Generalized additive models: uses local polynomial as the per-coordinate smoother.
- B-splines: basis-function fits without a kernel window.
- Nadaraya-Watson kernel regression: the local-constant predecessor.
Last reviewed: May 13, 2026
Canonical graph
Required before and derived from this topic
These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.
Required prerequisites
4- Linear Regressionlayer 1 · tier 1
- Nadaraya-Watson Kernel Regressionlayer 2 · tier 1
- Bias-Variance Tradeofflayer 2 · tier 2
- Kernels and Reproducing Kernel Hilbert Spaceslayer 3 · tier 2
Derived topics
0No published topic currently declares this as a prerequisite.