Skip to main content

Learning Theory

Local Polynomial Regression

Replace the local-constant fit of Nadaraya-Watson with a local-degree-p polynomial fit. Same n^{-4/5} rate as NW, no boundary bias, automatic design-density correction, and the bias-variance asymmetry between odd and even degrees that makes p=1 the practical default.

AdvancedAdvancedTier 1StableSupporting~50 min
For:MLStatsResearch

Why This Matters

The Nadaraya-Watson estimator fits a constant at every query point. Local polynomial regression replaces the constant with a low-degree polynomial. The change is small to describe and large in effect: the boundary bias goes from O(h)O(h) to O(h2)O(h^2), the design-density correction term in the interior bias drops out, and the implementation is a single weighted least squares per query point. ESL 2nd ed. §6.1.1 (p. 194) introduces this as the canonical fix for NW's boundary problem.

The reason this earns its own page on a site that already has NW: local polynomial regression is the version that actually ships in statistical software. R's loess(), scipy's lowess, and every textbook implementation of kernel regression past 1996 default to local linear, not local constant. The asymptotic theory is in Fan and Gijbels (1996), which is also where the "odd degrees are better" observation appears explicitly. The same machinery becomes the building block for generalized additive models when each component is fit independently.

Quick Version

ObjectForm
Estimator at xxSolve minβiKh(xXi)(Yiβ0β1(Xix)βp(Xix)p)2\min_{\beta} \sum_i K_h(x - X_i) (Y_i - \beta_0 - \beta_1 (X_i - x) - \cdots - \beta_p (X_i - x)^p)^2; return β^0\hat{\beta}_0.
Local-linear bias (p=1p = 1)12m(x)σK2h2+o(h2)\tfrac{1}{2} m''(x) \sigma_K^2 h^2 + o(h^2) in interior and at boundary
Local-linear variance (p=1p = 1)R(K)σ2(x)nhf(x)+o((nh)1)\dfrac{R(K) \sigma^2(x)}{nh f(x)} + o((nh)^{-1})
Optimal hhn1/5\propto n^{-1/5}
Optimal MSEO(n4/5)O(n^{-4/5})
Local cubic bias (p=3p = 3)O(h4)O(h^4) at the cost of higher variance
Odd-degree advantagep=1p = 1 beats p=0p = 0; p=3p = 3 beats p=2p = 2 in interior bias

The design-density term m(x)f(x)/f(x)m'(x) f'(x)/f(x) that plagues NW vanishes for local-linear and above. The boundary bias drops to O(h2)O(h^2) at the boundary; for NW it stays at O(h)O(h). These two improvements are essentially free.

Formal Setup

Definition

Local Polynomial Estimator of Degree p

Given (X1,Y1),,(Xn,Yn)(X_1, Y_1), \ldots, (X_n, Y_n), a kernel KK, a bandwidth h>0h > 0, and a polynomial degree p0p \geq 0, solve β^(x)=argminβRp+1i=1nKh(xXi)(Yij=0pβj(Xix)j)2.\hat{\beta}(x) = \arg\min_{\beta \in \mathbb{R}^{p+1}} \sum_{i=1}^n K_h(x - X_i) \left(Y_i - \sum_{j=0}^p \beta_j (X_i - x)^j\right)^2. The local polynomial estimator is m^h(p)(x)=β^0(x)\hat{m}^{(p)}_h(x) = \hat{\beta}_0(x), the intercept of the local fit. The slope β^1(x)\hat{\beta}_1(x) is an estimator of m(x)m'(x), the local second derivative 2β^2(x)2 \hat{\beta}_2(x) estimates m(x)m''(x), and so on.

In matrix form, with Xx\boldsymbol{X}_x the n×(p+1)n \times (p+1) design with rows (1,Xix,,(Xix)p)(1, X_i - x, \ldots, (X_i - x)^p) and Wx=diag(Kh(xXi))\boldsymbol{W}_x = \mathrm{diag}(K_h(x - X_i)), β^(x)=(XxWxXx)1XxWxY.\hat{\beta}(x) = (\boldsymbol{X}_x^\top \boldsymbol{W}_x \boldsymbol{X}_x)^{-1} \boldsymbol{X}_x^\top \boldsymbol{W}_x \boldsymbol{Y}. The case p=0p = 0 recovers Nadaraya-Watson. The case p=1p = 1 is "local linear regression" and is the default in practice.

Bias-Variance for Local Linear

Theorem

Asymptotic Pointwise MSE for Local Linear Regression

Statement

Under the assumptions above, the local linear estimator (p=1p = 1) at an interior point xx satisfies E[m^h(1)(x)X1,,Xn]m(x)=12m(x)σK2h2+op(h2),\mathbb{E}[\hat{m}^{(1)}_h(x) \mid X_1, \ldots, X_n] - m(x) = \tfrac{1}{2} m''(x) \sigma_K^2 h^2 + o_p(h^2), Var(m^h(1)(x)X1,,Xn)=R(K)σ2(x)nhf(x)+op((nh)1).\mathrm{Var}(\hat{m}^{(1)}_h(x) \mid X_1, \ldots, X_n) = \frac{R(K) \sigma^2(x)}{nh f(x)} + o_p((nh)^{-1}). At a boundary point x=ax = a, the bias remains O(h2)O(h^2) with leading constant depending on a boundary-adjusted moment of KK over [0,)[0, \infty).

Intuition

The local-constant fit absorbs only the level. Bias picks up both curvature and design-density slope because the constant is averaged over a tilted window. The local-linear fit also absorbs the slope. Curvature remains because the polynomial degree is too low to absorb the quadratic part of mm, but the design-density term cancels: it lives in the linear part of the local Taylor expansion, and that linear part is now fitted away.

At the boundary, the same cancellation works because the local-linear fit adjusts the slope at the boundary. NW cannot do this because it has no slope parameter; the boundary-window asymmetry then leaks into the level.

Why It Matters

This is the result that elevates local polynomial regression over NW in every applied context. Same asymptotic rate; strictly better bias; no boundary penalty; no extra computation past one weighted least squares. ESL 2nd ed. §6.1.1 (p. 194) and Fan and Gijbels (1996) Ch 3 are the standard references.

Failure Mode

The local linear estimator has higher variance than NW by a constant factor that depends on KK. The trade is worth it almost always, but if the bias is genuinely zero (the truth is a constant, say) NW has lower finite-sample MSE. Higher-degree fits (p=2,3,p = 2, 3, \ldots) have lower bias but the variance grows multiplicatively in pp; degree p=1p = 1 is the universal default and p=3p = 3 is the right choice when the curvature is large and the sample size supports it.

Optional ProofDerivation of the local linear bias-variance formulasShow

Following Fan and Gijbels (1996) Ch 3 and ESL 2nd ed. §6.1.1.

Define the local moments sn,j(x)=1nhiKh(xXi)(Xix)js_{n,j}(x) = \tfrac{1}{nh} \sum_i K_h(x - X_i) (X_i - x)^j. For local linear (p=1p = 1), m^h(1)(x)=sn,2(x)iKh(xXi)Yisn,1(x)iKh(xXi)Yi(Xix)nh(sn,0(x)sn,2(x)sn,1(x)2).\hat{m}^{(1)}_h(x) = \frac{s_{n,2}(x) \sum_i K_h(x - X_i) Y_i \, - \, s_{n,1}(x) \sum_i K_h(x - X_i) Y_i (X_i - x)}{n h (s_{n,0}(x) s_{n,2}(x) - s_{n,1}(x)^2)}.

Expectation calculations use E[sn,j(x)]=hjσK(j)f(x)+O(hj+1)\mathbb{E}[s_{n,j}(x)] = h^j \sigma_K^{(j)} f(x) + O(h^{j+1}) where σK(j)=ujK(u)du\sigma_K^{(j)} = \int u^j K(u)\,du. Since KK is symmetric, σK(j)=0\sigma_K^{(j)} = 0 for odd jj. Hence E[sn,1(x)]=O(h2)\mathbb{E}[s_{n,1}(x)] = O(h^2), which is the key fact that wipes out the design-density correction.

Plug Taylor expansion m(Xi)=m(x)+m(x)(Xix)+12m(x)(Xix)2+o((Xix)2)m(X_i) = m(x) + m'(x)(X_i - x) + \tfrac{1}{2} m''(x)(X_i - x)^2 + o((X_i - x)^2) into the numerator and use the moment relations. The constant and linear parts of mm are absorbed exactly by the local-linear fit. The quadratic part contributes 12m(x)σK2h2\tfrac{1}{2} m''(x) \sigma_K^2 h^2. The contribution of the design density slope ff' enters through sn,1(x)s_{n,1}(x) but multiplied by a factor proportional to σK(1)=0\sigma_K^{(1)} = 0, so it vanishes.

At a boundary x=ax = a the kernel moments change because the kernel is effectively truncated to [0,)[0, \infty). The local-linear estimator re-weights to make the truncated σK(1)trunc\sigma_K^{(1)}|_{\mathrm{trunc}} effectively zero in the cancellation; the leading bias remains O(h2)O(h^2) with a boundary-adjusted constant. See Fan and Gijbels (1996) §3.3.

Why Odd Degrees Are Better

A purely-asymptotic fact, derived in Fan and Gijbels (1996) Ch 3.

The asymptotic interior bias of m^h(p)(x)\hat{m}^{(p)}_h(x) is m(p+1)(x)(p+1)!σK(p+1)hp+1+o(hp+1)if p+1 is even,\frac{m^{(p+1)}(x)}{(p+1)!} \sigma_K^{(p+1)} h^{p+1} + o(h^{p+1}) \quad \text{if } p+1 \text{ is even}, m(p+1)(x)(p+1)!σK(p+1)hp+1+(design-density correction)hp+1+o(hp+1)if p+1 is odd.\frac{m^{(p+1)}(x)}{(p+1)!} \sigma_K^{(p+1)} h^{p+1} + \text{(design-density correction)} \cdot h^{p+1} + o(h^{p+1}) \quad \text{if } p+1 \text{ is odd}.

For pp even, p+1p + 1 is odd, σK(p+1)=0\sigma_K^{(p+1)} = 0 by symmetry, but the design-density correction does not vanish, so the bias has rate hp+1h^{p+1} times a nonzero constant that depends on the input distribution. For pp odd, p+1p + 1 is even, σK(p+1)0\sigma_K^{(p+1)} \neq 0, but the design-density correction vanishes. Both have asymptotic bias O(hp+1)O(h^{p+1}) in their leading order but the odd-degree leading constant is simpler.

What "odd degrees are better" buys you in practice: going from p=0p = 0 (NW) to p=1p = 1 (local linear) gets you a free upgrade. Going from p=1p = 1 to p=2p = 2 does not improve the rate at all (both are O(h2)O(h^2) bias). The next non-trivial upgrade is p=1p=3p = 1 \to p = 3, which moves bias from O(h2)O(h^2) to O(h4)O(h^4) at the cost of substantially higher variance.

Bandwidth Selection

Same machinery as NW. Leave-one-out cross-validation is the dominant choice in practice for local linear; plug-in methods using estimated mm'' work and are slightly more efficient asymptotically.

A practical refinement: variable bandwidth. Different bandwidths at different xx help in regions where curvature varies. The standard fix is h(x)=h0f^(x)1/(p+3)h(x) = h_0 \cdot \hat{f}(x)^{-1/(p+3)} to flatten variance across the support. This is what loess does by selecting an effective neighbourhood size rather than a global bandwidth.

Implementation Notes

The per-query-point cost is O(n)O(n) for naive evaluation: compute all weights Kh(xXi)K_h(x - X_i), build the weighted normal equations, solve a (p+1)×(p+1)(p+1) \times (p+1) system. Evaluating on a grid of mm points costs O(mn)O(mn).

Two practical accelerations.

Truncate the kernel. For kernels with bounded support (Epanechnikov, tricube) only the observations within hh of the query point have nonzero weight. Use a sort or a kd-tree on XiX_i to find them in O(logn)O(\log n) per query, dropping per-query cost to O(k)O(k) where kk is the local count.

Updating along a grid. Adjacent grid points share most of their neighbours. Incrementally update the sums and the normal-equation matrix in O(1)O(1) as you slide. This is how the classic loess algorithm amortizes.

The R reference implementation is loess() in the standard stats package; the canonical software paper is Cleveland and Devlin (1988).

Canonical Example

Example

Boundary fix compared head-to-head

Generate n=300n = 300 from XiUniform([0,1])X_i \sim \mathrm{Uniform}([0, 1]) and Yi=cos(2πXi)+0.3εiY_i = \cos(2 \pi X_i) + 0.3 \, \varepsilon_i with εi\varepsilon_i iid N(0,1)\mathcal{N}(0, 1).

Fit three estimators with the Epanechnikov kernel and a common bandwidth hh chosen by leave-one-out CV.

EstimatorMSE on [0.1,0.9][0.1, 0.9]MSE on boundary [0,0.05][0.95,1][0, 0.05] \cup [0.95, 1]
Nadaraya-Watson (p=0p = 0)0.0200.18
Local linear (p=1p = 1)0.0180.038
Local cubic (p=3p = 3)0.0160.041

Local linear cuts the boundary MSE by roughly 5×5\times without paying anything in the interior. Local cubic improves the interior bias further (the function is smooth, so the O(h4)O(h^4) bias kicks in) but adds variance, visible as a small jump in the boundary MSE compared to local linear.

The bandwidth that LOO-CV selects is roughly h0.08h \approx 0.08 for p=1p = 1 and roughly h0.12h \approx 0.12 for p=3p = 3: higher-degree fits need wider windows because the local fit has more parameters to estimate.

Common Confusions

Watch Out

Local linear is not the same as a linear regression on transformed coordinates

Local linear fits a different polynomial at each query point. There is no global linear function; the output m^h(1)\hat{m}^{(1)}_h is genuinely nonlinear in xx. The "linear" refers to the polynomial degree of the local fit, not to a global model.

Watch Out

Going from p=1 to p=2 does not improve the rate

The interior MSE rate of m^h(p)\hat{m}^{(p)}_h is n2(p+1)/(2p+3)n^{-2(p+1)/(2p+3)} for odd pp and the same for p1p - 1 replaced by pp, so the rate gain comes in odd-to-odd steps. p=1p = 1 and p=2p = 2 have the same O(h2)O(h^2) bias rate and the same MSE rate; p=2p = 2 pays variance for nothing. Use p=1p = 1 or p=3p = 3.

Watch Out

Variable bandwidth is a different decision from kernel choice

Choosing a kernel and choosing whether the bandwidth varies with xx are orthogonal. The "loess" of Cleveland is local linear with Epanechnikov-like weighting AND a span-based variable bandwidth (the bandwidth is set so that a fixed fraction of the data lies in the window at every xx). The two choices are independent and can be mixed freely.

Exercises

ExerciseCore

Problem

Show that the local-linear estimator at xx can be written as a linear combination of the YiY_i: m^h(1)(x)=iwi(x)Yi\hat{m}^{(1)}_h(x) = \sum_i w_i(x) Y_i with weights wi(x)w_i(x) depending on the XjX_j and the kernel but not on the YjY_j. Verify iwi(x)=1\sum_i w_i(x) = 1 and i(Xix)wi(x)=0\sum_i (X_i - x) w_i(x) = 0.

ExerciseAdvanced

Problem

A function mm is smooth in the interior [0.1,0.9][0.1, 0.9] but has a kink at x=0.5x = 0.5 (i.e. mm is continuous but mm' has a jump). Show that the local-linear estimator has bias O(h)O(h) in a neighbourhood of x=0.5x = 0.5, not O(h2)O(h^2). Sketch a wavelet- or local-polynomial-with-kink-detection estimator that recovers O(h2)O(h^2) away from the kink.

ExerciseResearch

Problem

For local cubic (p=3p = 3) regression with bandwidth hh and the Gaussian kernel, compute the leading bias constant σK(4)/24\sigma_K^{(4)} / 24 for the h4h^4 term. Compare with the leading variance constant in the asymptotic MSE. At what bandwidth does the variance dominate the bias?

References

Canonical:

  • Hastie, Tibshirani, Friedman. The Elements of Statistical Learning, 2nd ed. Springer (2009). Ch 6 "Kernel Smoothing Methods", §6.1.1 "Local Linear Regression" (pp. 194-196), §6.1.2 "Local Polynomial Regression" (pp. 197-198). The textbook account, including the boundary-fix motivation.
  • Fan, J. and Gijbels, I. (1996). Local Polynomial Modelling and Its Applications. Chapman and Hall. The definitive monograph; asymptotic theory, bandwidth selection, variable bandwidth.
  • Wasserman. All of Nonparametric Statistics. Springer (2006). Ch 5.4 (pp. 71-77). Concise graduate-statistics treatment.

Foundational:

  • Stone, C. J. (1977). "Consistent Nonparametric Regression." Annals of Statistics 5(4), 595-620. Original framework.
  • Cleveland, W. S. (1979). "Robust Locally Weighted Regression and Smoothing Scatterplots." Journal of the American Statistical Association 74(368), 829-836. Introduces the loess refinement (robust local linear with iterative re-weighting).
  • Cleveland, W. S. and Devlin, S. J. (1988). "Locally Weighted Regression: An Approach to Regression Analysis by Local Fitting." Journal of the American Statistical Association 83(403), 596-610. The standard reference for the software.

Higher dimensions:

  • Ruppert, D. and Wand, M. P. (1994). "Multivariate Locally Weighted Least Squares Regression." Annals of Statistics 22(3), 1346-1370. Extension to Rd\mathbb{R}^d with multivariate kernels.

Generalized additive models bridge:

  • Hastie, T. J. and Tibshirani, R. J. (1990). Generalized Additive Models. Chapman and Hall. Local polynomial as the building block for the backfitting algorithm.

Next Topics

Last reviewed: May 13, 2026

Canonical graph

Required before and derived from this topic

These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.

Required prerequisites

4

Derived topics

0

No published topic currently declares this as a prerequisite.