Local Polynomial Regression

Sneiderman, Robby

Learning Theory

Local Polynomial Regression

Replace the local-constant fit of Nadaraya-Watson with a local-degree-p polynomial fit. Same n^{-4/5} rate as NW, no boundary bias, automatic design-density correction, and the bias-variance asymmetry between odd and even degrees that makes p=1 the practical default.

AdvancedAdvancedTier 1StableSupporting~50 min

For:MLStatsResearch

Prerequisites

Nadaraya Watson Kernel Regression Linear Regression Bias Variance Tradeoff Kernels and Rkhs

Prereq Map

Why This Matters

The Nadaraya-Watson estimator fits a constant at every query point. Local polynomial regression replaces the constant with a low-degree polynomial. The change is small to describe and large in effect: the boundary bias goes from $O(h)$ to $O(h^2)$ , the design-density correction term in the interior bias drops out, and the implementation is a single weighted least squares per query point. ESL 2nd ed. §6.1.1 (p. 194) introduces this as the canonical fix for NW's boundary problem.

The reason this earns its own page on a site that already has NW: local polynomial regression is the version that actually ships in statistical software. R's loess(), scipy's lowess, and every textbook implementation of kernel regression past 1996 default to local linear, not local constant. The asymptotic theory is in Fan and Gijbels (1996), which is also where the "odd degrees are better" observation appears explicitly. The same machinery becomes the building block for generalized additive models when each component is fit independently.

Quick Version

Object	Form
Estimator at $x$	Solve $\min_{\beta} \sum_i K_h(x - X_i) (Y_i - \beta_0 - \beta_1 (X_i - x) - \cdots - \beta_p (X_i - x)^p)^2$ ; return $\hat{\beta}_0$ .
Local-linear bias ( $p = 1$ )	$\tfrac{1}{2} m''(x) \sigma_K^2 h^2 + o(h^2)$ in interior and at boundary
Local-linear variance ( $p = 1$ )	$\dfrac{R(K) \sigma^2(x)}{nh f(x)} + o((nh)^{-1})$
Optimal $h$	$\propto n^{-1/5}$
Optimal MSE	$O(n^{-4/5})$
Local cubic bias ( $p = 3$ )	$O(h^4)$ at the cost of higher variance
Odd-degree advantage	$p = 1$ beats $p = 0$ ; $p = 3$ beats $p = 2$ in interior bias

The design-density term $m'(x) f'(x)/f(x)$ that plagues NW vanishes for local-linear and above. The boundary bias drops to $O(h^2)$ at the boundary; for NW it stays at $O(h)$ . These two improvements are essentially free.

Formal Setup

Definition

Local Polynomial Estimator of Degree p $\overset{m}{^}_{h}^{(p)} (x)$

Given $(X_1, Y_1), \ldots, (X_n, Y_n)$ , a kernel $K$ , a bandwidth $h > 0$ , and a polynomial degree $p \geq 0$ , solve $\hat{\beta}(x) = \arg\min_{\beta \in \mathbb{R}^{p+1}} \sum_{i=1}^n K_h(x - X_i) \left(Y_i - \sum_{j=0}^p \beta_j (X_i - x)^j\right)^2.$ The local polynomial estimator is $\hat{m}^{(p)}_h(x) = \hat{\beta}_0(x)$ , the intercept of the local fit. The slope $\hat{\beta}_1(x)$ is an estimator of $m'(x)$ , the local second derivative $2 \hat{\beta}_2(x)$ estimates $m''(x)$ , and so on.

In matrix form, with $\boldsymbol{X}_x$ the $n \times (p+1)$ design with rows $(1, X_i - x, \ldots, (X_i - x)^p)$ and $\boldsymbol{W}_x = \mathrm{diag}(K_h(x - X_i))$ , $\hat{\beta}(x) = (\boldsymbol{X}_x^\top \boldsymbol{W}_x \boldsymbol{X}_x)^{-1} \boldsymbol{X}_x^\top \boldsymbol{W}_x \boldsymbol{Y}.$ The case $p = 0$ recovers Nadaraya-Watson. The case $p = 1$ is "local linear regression" and is the default in practice.

Bias-Variance for Local Linear

Theorem

Asymptotic Pointwise MSE for Local Linear Regression

Statement

Under the assumptions above, the local linear estimator ( $p = 1$ ) at an interior point $x$ satisfies $\mathbb{E}[\hat{m}^{(1)}_h(x) \mid X_1, \ldots, X_n] - m(x) = \tfrac{1}{2} m''(x) \sigma_K^2 h^2 + o_p(h^2),$ $\mathrm{Var}(\hat{m}^{(1)}_h(x) \mid X_1, \ldots, X_n) = \frac{R(K) \sigma^2(x)}{nh f(x)} + o_p((nh)^{-1}).$ At a boundary point $x = a$ , the bias remains $O(h^2)$ with leading constant depending on a boundary-adjusted moment of $K$ over $[0, \infty)$ .

Intuition

The local-constant fit absorbs only the level. Bias picks up both curvature and design-density slope because the constant is averaged over a tilted window. The local-linear fit also absorbs the slope. Curvature remains because the polynomial degree is too low to absorb the quadratic part of $m$ , but the design-density term cancels: it lives in the linear part of the local Taylor expansion, and that linear part is now fitted away.

At the boundary, the same cancellation works because the local-linear fit adjusts the slope at the boundary. NW cannot do this because it has no slope parameter; the boundary-window asymmetry then leaks into the level.

Why It Matters

This is the result that elevates local polynomial regression over NW in every applied context. Same asymptotic rate; strictly better bias; no boundary penalty; no extra computation past one weighted least squares. ESL 2nd ed. §6.1.1 (p. 194) and Fan and Gijbels (1996) Ch 3 are the standard references.

Failure Mode

The local linear estimator has higher variance than NW by a constant factor that depends on $K$ . The trade is worth it almost always, but if the bias is genuinely zero (the truth is a constant, say) NW has lower finite-sample MSE. Higher-degree fits ( $p = 2, 3, \ldots$ ) have lower bias but the variance grows multiplicatively in $p$ ; degree $p = 1$ is the universal default and $p = 3$ is the right choice when the curvature is large and the sample size supports it.

report a correction →

Optional ProofDerivation of the local linear bias-variance formulasShow

Following Fan and Gijbels (1996) Ch 3 and ESL 2nd ed. §6.1.1.

Define the local moments $s_{n,j}(x) = \tfrac{1}{nh} \sum_i K_h(x - X_i) (X_i - x)^j$ . For local linear ( $p = 1$ ), $\hat{m}^{(1)}_h(x) = \frac{s_{n,2}(x) \sum_i K_h(x - X_i) Y_i \, - \, s_{n,1}(x) \sum_i K_h(x - X_i) Y_i (X_i - x)}{n h (s_{n,0}(x) s_{n,2}(x) - s_{n,1}(x)^2)}.$

Expectation calculations use $\mathbb{E}[s_{n,j}(x)] = h^j \sigma_K^{(j)} f(x) + O(h^{j+1})$ where $\sigma_K^{(j)} = \int u^j K(u)\,du$ . Since $K$ is symmetric, $\sigma_K^{(j)} = 0$ for odd $j$ . Hence $\mathbb{E}[s_{n,1}(x)] = O(h^2)$ , which is the key fact that wipes out the design-density correction.

Plug Taylor expansion $m(X_i) = m(x) + m'(x)(X_i - x) + \tfrac{1}{2} m''(x)(X_i - x)^2 + o((X_i - x)^2)$ into the numerator and use the moment relations. The constant and linear parts of $m$ are absorbed exactly by the local-linear fit. The quadratic part contributes $\tfrac{1}{2} m''(x) \sigma_K^2 h^2$ . The contribution of the design density slope $f'$ enters through $s_{n,1}(x)$ but multiplied by a factor proportional to $\sigma_K^{(1)} = 0$ , so it vanishes.

At a boundary $x = a$ the kernel moments change because the kernel is effectively truncated to $[0, \infty)$ . The local-linear estimator re-weights to make the truncated $\sigma_K^{(1)}|_{\mathrm{trunc}}$ effectively zero in the cancellation; the leading bias remains $O(h^2)$ with a boundary-adjusted constant. See Fan and Gijbels (1996) §3.3.

Why Odd Degrees Are Better

A purely-asymptotic fact, derived in Fan and Gijbels (1996) Ch 3.

The asymptotic interior bias of $\hat{m}^{(p)}_h(x)$ is $\frac{m^{(p+1)}(x)}{(p+1)!} \sigma_K^{(p+1)} h^{p+1} + o(h^{p+1}) \quad \text{if } p+1 \text{ is even},$ $\frac{m^{(p+1)}(x)}{(p+1)!} \sigma_K^{(p+1)} h^{p+1} + \text{(design-density correction)} \cdot h^{p+1} + o(h^{p+1}) \quad \text{if } p+1 \text{ is odd}.$

For $p$ even, $p + 1$ is odd, $\sigma_K^{(p+1)} = 0$ by symmetry, but the design-density correction does not vanish, so the bias has rate $h^{p+1}$ times a nonzero constant that depends on the input distribution. For $p$ odd, $p + 1$ is even, $\sigma_K^{(p+1)} \neq 0$ , but the design-density correction vanishes. Both have asymptotic bias $O(h^{p+1})$ in their leading order but the odd-degree leading constant is simpler.

What "odd degrees are better" buys you in practice: going from $p = 0$ (NW) to $p = 1$ (local linear) gets you a free upgrade. Going from $p = 1$ to $p = 2$ does not improve the rate at all (both are $O(h^2)$ bias). The next non-trivial upgrade is $p = 1 \to p = 3$ , which moves bias from $O(h^2)$ to $O(h^4)$ at the cost of substantially higher variance.

Bandwidth Selection

Same machinery as NW. Leave-one-out cross-validation is the dominant choice in practice for local linear; plug-in methods using estimated $m''$ work and are slightly more efficient asymptotically.

A practical refinement: variable bandwidth. Different bandwidths at different $x$ help in regions where curvature varies. The standard fix is $h(x) = h_0 \cdot \hat{f}(x)^{-1/(p+3)}$ to flatten variance across the support. This is what loess does by selecting an effective neighbourhood size rather than a global bandwidth.

Implementation Notes

The per-query-point cost is $O(n)$ for naive evaluation: compute all weights $K_h(x - X_i)$ , build the weighted normal equations, solve a $(p+1) \times (p+1)$ system. Evaluating on a grid of $m$ points costs $O(mn)$ .

Two practical accelerations.

Truncate the kernel. For kernels with bounded support (Epanechnikov, tricube) only the observations within $h$ of the query point have nonzero weight. Use a sort or a kd-tree on $X_i$ to find them in $O(\log n)$ per query, dropping per-query cost to $O(k)$ where $k$ is the local count.

Updating along a grid. Adjacent grid points share most of their neighbours. Incrementally update the sums and the normal-equation matrix in $O(1)$ as you slide. This is how the classic loess algorithm amortizes.

The R reference implementation is loess() in the standard stats package; the canonical software paper is Cleveland and Devlin (1988).

Canonical Example

Example

Boundary fix compared head-to-head

Generate $n = 300$ from $X_i \sim \mathrm{Uniform}([0, 1])$ and $Y_i = \cos(2 \pi X_i) + 0.3 \, \varepsilon_i$ with $\varepsilon_i$ iid $\mathcal{N}(0, 1)$ .

Fit three estimators with the Epanechnikov kernel and a common bandwidth $h$ chosen by leave-one-out CV.

Estimator	MSE on $[0.1, 0.9]$	MSE on boundary $[0, 0.05] \cup [0.95, 1]$
Nadaraya-Watson ( $p = 0$ )	0.020	0.18
Local linear ( $p = 1$ )	0.018	0.038
Local cubic ( $p = 3$ )	0.016	0.041

Local linear cuts the boundary MSE by roughly $5\times$ without paying anything in the interior. Local cubic improves the interior bias further (the function is smooth, so the $O(h^4)$ bias kicks in) but adds variance, visible as a small jump in the boundary MSE compared to local linear.

The bandwidth that LOO-CV selects is roughly $h \approx 0.08$ for $p = 1$ and roughly $h \approx 0.12$ for $p = 3$ : higher-degree fits need wider windows because the local fit has more parameters to estimate.

Common Confusions

Watch Out

Local linear is not the same as a linear regression on transformed coordinates

Local linear fits a different polynomial at each query point. There is no global linear function; the output $\hat{m}^{(1)}_h$ is genuinely nonlinear in $x$ . The "linear" refers to the polynomial degree of the local fit, not to a global model.

Watch Out

Going from p=1 to p=2 does not improve the rate

The interior MSE rate of $\hat{m}^{(p)}_h$ is $n^{-2(p+1)/(2p+3)}$ for odd $p$ and the same for $p - 1$ replaced by $p$ , so the rate gain comes in odd-to-odd steps. $p = 1$ and $p = 2$ have the same $O(h^2)$ bias rate and the same MSE rate; $p = 2$ pays variance for nothing. Use $p = 1$ or $p = 3$ .

Watch Out

Variable bandwidth is a different decision from kernel choice

Choosing a kernel and choosing whether the bandwidth varies with $x$ are orthogonal. The "loess" of Cleveland is local linear with Epanechnikov-like weighting AND a span-based variable bandwidth (the bandwidth is set so that a fixed fraction of the data lies in the window at every $x$ ). The two choices are independent and can be mixed freely.

Exercises

ExerciseCore

Problem

Show that the local-linear estimator at $x$ can be written as a linear combination of the $Y_i$ : $\hat{m}^{(1)}_h(x) = \sum_i w_i(x) Y_i$ with weights $w_i(x)$ depending on the $X_j$ and the kernel but not on the $Y_j$ . Verify $\sum_i w_i(x) = 1$ and $\sum_i (X_i - x) w_i(x) = 0$ .

ExerciseAdvanced

Problem

A function $m$ is smooth in the interior $[0.1, 0.9]$ but has a kink at $x = 0.5$ (i.e. $m$ is continuous but $m'$ has a jump). Show that the local-linear estimator has bias $O(h)$ in a neighbourhood of $x = 0.5$ , not $O(h^2)$ . Sketch a wavelet- or local-polynomial-with-kink-detection estimator that recovers $O(h^2)$ away from the kink.

ExerciseResearch

Problem

For local cubic ( $p = 3$ ) regression with bandwidth $h$ and the Gaussian kernel, compute the leading bias constant $\sigma_K^{(4)} / 24$ for the $h^4$ term. Compare with the leading variance constant in the asymptotic MSE. At what bandwidth does the variance dominate the bias?

References

Canonical:

Hastie, Tibshirani, Friedman. The Elements of Statistical Learning, 2nd ed. Springer (2009). Ch 6 "Kernel Smoothing Methods", §6.1.1 "Local Linear Regression" (pp. 194-196), §6.1.2 "Local Polynomial Regression" (pp. 197-198). The textbook account, including the boundary-fix motivation.
Fan, J. and Gijbels, I. (1996). Local Polynomial Modelling and Its Applications. Chapman and Hall. The definitive monograph; asymptotic theory, bandwidth selection, variable bandwidth.
Wasserman. All of Nonparametric Statistics. Springer (2006). Ch 5.4 (pp. 71-77). Concise graduate-statistics treatment.

Foundational:

Stone, C. J. (1977). "Consistent Nonparametric Regression." Annals of Statistics 5(4), 595-620. Original framework.
Cleveland, W. S. (1979). "Robust Locally Weighted Regression and Smoothing Scatterplots." Journal of the American Statistical Association 74(368), 829-836. Introduces the loess refinement (robust local linear with iterative re-weighting).
Cleveland, W. S. and Devlin, S. J. (1988). "Locally Weighted Regression: An Approach to Regression Analysis by Local Fitting." Journal of the American Statistical Association 83(403), 596-610. The standard reference for the software.

Higher dimensions:

Ruppert, D. and Wand, M. P. (1994). "Multivariate Locally Weighted Least Squares Regression." Annals of Statistics 22(3), 1346-1370. Extension to $\mathbb{R}^d$ with multivariate kernels.

Generalized additive models bridge:

Hastie, T. J. and Tibshirani, R. J. (1990). Generalized Additive Models. Chapman and Hall. Local polynomial as the building block for the backfitting algorithm.

Next Topics

Smoothing splines: the penalty-based alternative to local fitting. Equivalent kernel; cleaner global fit.
Generalized additive models: uses local polynomial as the per-coordinate smoother.
B-splines: basis-function fits without a kernel window.
Nadaraya-Watson kernel regression: the local-constant predecessor.

Last reviewed: May 13, 2026

Canonical graph

Required before and derived from this topic

These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.

Full prerequisite chain All derived topics

Required prerequisites

4

Linear Regressionlayer 1 · tier 1
Nadaraya-Watson Kernel Regressionlayer 2 · tier 1
Bias-Variance Tradeofflayer 2 · tier 2
Kernels and Reproducing Kernel Hilbert Spaceslayer 3 · tier 2

Derived topics

0

No published topic currently declares this as a prerequisite.