Beta. Content is under active construction and has not been peer-reviewed. Report errors on GitHub.Disclaimer

ML Methods

K-Nearest Neighbors

Classify by majority vote of the k closest training points: no training phase, universal consistency as n and k grow, and the curse of dimensionality that makes distance meaningless in high dimensions.

CoreTier 2Stable~40 min
0

Why This Matters

K-nearest neighbors is the simplest nonparametric classifier: store all the training data, and at prediction time, find the kk closest points and take a majority vote. There is no training phase at all.

Despite this simplicity, KNN has a remarkable theoretical property: it is universally consistent. As the number of training points nn \to \infty and kk \to \infty with k/n0k/n \to 0, KNN converges to the Bayes optimal classifier. This means KNN can learn any decision boundary, given enough data.

The catch is the curse of dimensionality. In high dimensions, the notion of "nearest" breaks down because all points become approximately equidistant. Understanding this tradeoff between the theoretical beauty of consistency and the practical failure in high dimensions is essential.

Mental Model

Imagine the training points scattered in space, each colored by its class. To classify a new point, draw a small ball around it, find the kk nearest training points inside (or nearest to) that ball, and predict the majority class. As you get more data, the ball shrinks, and the local vote becomes a better estimate of the true class probability at that point.

Formal Setup and Notation

We have training data {(x1,y1),,(xn,yn)}\{(x_1, y_1), \ldots, (x_n, y_n)\} with xiRdx_i \in \mathbb{R}^d and yi{1,,C}y_i \in \{1, \ldots, C\} (for CC classes). Let D\mathcal{D} be the joint distribution over (X,Y)(X, Y).

Definition

K-Nearest Neighbors Classifier

Given a query point xx, let x(1),x(2),,x(k)x_{(1)}, x_{(2)}, \ldots, x_{(k)} be the kk training points closest to xx in Euclidean distance (or another metric), with corresponding labels y(1),,y(k)y_{(1)}, \ldots, y_{(k)}. The KNN classifier predicts:

y^(x)=argmaxc{1,,C}j=1k1[y(j)=c]\hat{y}(x) = \arg\max_{c \in \{1,\ldots,C\}} \sum_{j=1}^{k} \mathbf{1}[y_{(j)} = c]

That is, predict the class that appears most frequently among the kk nearest neighbors.

Definition

Bayes Optimal Classifier

The Bayes optimal classifier predicts the class with the highest posterior probability:

h(x)=argmaxcP(Y=cX=x)h^*(x) = \arg\max_c \, P(Y = c \mid X = x)

Its error rate R=E[1[h(X)Y]]R^* = \mathbb{E}[\mathbf{1}[h^*(X) \neq Y]] is the Bayes risk, the lowest achievable error rate for any classifier.

Core Definitions

Definition

Lazy Learning

KNN is a lazy learner: it performs no computation during the training phase (just stores the data). All computation is deferred to prediction time. This contrasts with eager learners like SVMs or neural networks that build a model during training.

Definition

Curse of Dimensionality (for KNN)

In high dimensions, the volume of a ball grows exponentially with dd. To capture a fixed fraction ff of the data in a ball of radius rr in Rd\mathbb{R}^d, we need rf1/dr \propto f^{1/d}. When dd is large, even a ball containing a small fraction of points has a radius close to the diameter of the entire dataset. All points become nearly equidistant, and the concept of "nearest" loses meaning.

Main Theorems

Theorem

1-NN Error Bound (Cover and Hart)

Statement

Let R1NNR_{1\text{NN}} denote the asymptotic error rate of the 1-nearest neighbor classifier as nn \to \infty. Then:

RR1NN2RCC1(R)22RR^* \leq R_{1\text{NN}} \leq 2R^* - \frac{C}{C-1}(R^*)^2 \leq 2R^*

where RR^* is the Bayes risk and CC is the number of classes. For binary classification (C=2C = 2): R1NN2R(1R)R_{1\text{NN}} \leq 2R^*(1 - R^*).

Intuition

The 1-NN classifier effectively sees two independent draws from the distribution near each query point: the query itself and its nearest neighbor. It gets the answer right when both draws agree (which happens most of the time if RR^* is small). The factor of 2 arises because 1-NN uses one neighbor's label as a "noisy proxy" for the Bayes decision.

Proof Sketch

As nn \to \infty, the nearest neighbor x(1)xx_{(1)} \to x almost surely (by density of the training set). So Y(1)Y_{(1)} is approximately an independent draw from P(YX=x)P(Y | X = x). The 1-NN classifier errs when the nearest neighbor has the wrong label. Compute this probability using the conditional class probabilities and bound using the Bayes risk.

Why It Matters

This theorem says 1-NN is at most twice as bad as the Bayes optimal classifier asymptotically. For a method with zero parameters and zero assumptions about the data distribution, this is surprisingly good. It establishes KNN as a strong baseline.

Failure Mode

The bound is asymptotic (nn \to \infty). In finite samples and high dimensions, 1-NN can be far worse than 2R2R^* because the nearest neighbor may not actually be "near" in any meaningful sense.

Theorem

Universal Consistency of KNN (Stone's Theorem)

Statement

If k=k(n)k = k(n) satisfies k(n)k(n) \to \infty and k(n)/n0k(n)/n \to 0 as nn \to \infty, then the kk-NN classifier is universally consistent:

RkNNRR_{k\text{NN}} \to R^*

as nn \to \infty, for any underlying distribution D\mathcal{D}.

Intuition

As nn grows, the kk nearest neighbors all converge to the query point (because k/n0k/n \to 0, so the neighborhood shrinks). The majority vote over kk labels then estimates P(Y=cX=x)P(Y = c | X = x) accurately (because kk \to \infty gives the law of large numbers). The combination drives the error to the Bayes rate.

Proof Sketch

Stone (1977) proved a general result: any weighted local averaging rule is universally consistent if the weights satisfy certain conditions (they sum to 1, each weight goes to 0, and the weights concentrate near the query point). KNN with kk \to \infty, k/n0k/n \to 0 satisfies these conditions. The proof uses the fact that the nearest-neighbor distance shrinks to zero while the number of neighbors grows, ensuring both locality and averaging.

Why It Matters

Universal consistency means KNN can learn any decision boundary given enough data, with no assumptions on the data distribution. This is a property that parametric models (like logistic regression) do not have. It places KNN among the theoretically most powerful classifiers.

Failure Mode

Universal consistency is an asymptotic property. The rate of convergence can be extremely slow in high dimensions. The sample size needed to achieve a given accuracy grows exponentially with dimension dd (curse of dimensionality).

The Curse of Dimensionality in Detail

Consider points uniformly distributed in the unit cube [0,1]d[0,1]^d. To capture a fraction ff of the data in a hypercube neighborhood, the side length must be f1/df^{1/d}. For f=0.01f = 0.01 (1% of data) in d=100d = 100 dimensions, the side length is 0.011/1000.9550.01^{1/100} \approx 0.955. The "local" neighborhood spans 95.5% of the range of each feature. The neighborhood is not local at all.

This has a concrete consequence: in high dimensions, the distance from a query point to its nearest neighbor and its farthest neighbor become nearly equal. Formally, for random points in Rd\mathbb{R}^d:

dmaxdmindmin0as d\frac{d_{\max} - d_{\min}}{d_{\min}} \to 0 \quad \text{as } d \to \infty

under mild conditions. When all distances are approximately equal, ranking by distance is meaningless.

Computational Considerations

The naive KNN classifier requires O(nd)O(nd) time per query: compute the distance from the query to all nn training points in dd dimensions. This is expensive for large datasets.

Data structures that speed up nearest neighbor search:

  • KD-trees: partition space along coordinate axes. Average query time O(dlogn)O(d \log n) in low dimensions, but degrades to O(nd)O(nd) when dd is large.
  • Ball trees: partition space using nested hyperspheres. Better than KD-trees in moderate dimensions.
  • Locality-sensitive hashing (LSH): approximate nearest neighbors in sublinear time. Trades exactness for speed.

Canonical Examples

Example

KNN on the iris dataset

The iris dataset has n=150n = 150 points in d=4d = 4 dimensions with C=3C = 3 classes. KNN with k=5k = 5 achieves about 97% accuracy. The low dimension means distance is meaningful, and the classes are well-separated. This is the ideal regime for KNN: moderate nn, low dd, clear structure.

Example

KNN fails on high-dimensional text

In text classification with bag-of-words features, dd can be 10,000+. Most documents have similar cosine distances to any query document. KNN with Euclidean distance performs poorly. Using cosine similarity or TF-IDF weighting partially alleviates this, but KNN is generally outperformed by methods like naive Bayes or SVMs in this regime.

Common Confusions

Watch Out

The choice of k is not about bias-variance in the usual parametric sense

Increasing kk does not add "model complexity". KNN has no parameters. Instead, kk controls the smoothness of the decision boundary. Small kk gives a jagged boundary (high variance, low bias). Large kk gives a smooth boundary (low variance, high bias). But this is local smoothing, not parameter estimation.

Watch Out

KNN is sensitive to the distance metric, not just k

The choice of distance metric (Euclidean, Manhattan, Mahalanobis, cosine) profoundly affects KNN. Features on different scales make Euclidean distance dominated by the largest-scale feature. Always normalize features or use a learned metric.

Summary

  • KNN predicts by majority vote of the kk nearest training points
  • No training phase (lazy learning); all computation at prediction time
  • 1-NN asymptotic error is at most 2R2R^* (Cover and Hart)
  • Universal consistency: KNN converges to Bayes optimal if kk \to \infty, k/n0k/n \to 0
  • Curse of dimensionality: in high dd, all points are equidistant
  • Naive prediction cost is O(nd)O(nd) per query; KD-trees and LSH help in low/moderate dd

Exercises

ExerciseCore

Problem

If the Bayes risk is R=0.1R^* = 0.1 for a binary classification problem, what is the upper bound on the asymptotic 1-NN error rate?

ExerciseCore

Problem

You have n=1000n = 1000 points in d=2d = 2 dimensions and use k=31k = 31. Does this choice of kk satisfy the conditions for universal consistency? What about k=n=1000k = n = 1000?

ExerciseAdvanced

Problem

Show quantitatively why KNN breaks in high dimensions. Compute the expected distance from the origin to its nearest neighbor among nn points drawn uniformly from the unit cube [0,1]d[0,1]^d. How does this scale with dd?

References

Canonical:

  • Cover & Hart, "Nearest neighbor pattern classification" (1967)
  • Stone, "Consistent nonparametric regression" (1977)
  • Fix & Hodges, "Discriminatory analysis: nonparametric discrimination" (1951)

Current:

  • Hastie, Tibshirani, Friedman, The Elements of Statistical Learning (2009), Chapter 13

  • Shalev-Shwartz & Ben-David, Understanding Machine Learning (2014), Chapter 19

  • Bishop, Pattern Recognition and Machine Learning (2006), Chapters 1-14

Next Topics

The natural next steps from KNN:

Last reviewed: April 2026

Next Topics