ML Methods
K-Nearest Neighbors
Classify by majority vote of the k closest training points: no training phase, universal consistency as n and k grow, and the curse of dimensionality that makes distance meaningless in high dimensions.
Why This Matters
K-nearest neighbors is the simplest nonparametric classifier: store all the training data, and at prediction time, find the closest points and take a majority vote. There is no training phase at all.
Despite this simplicity, KNN has a remarkable theoretical property: it is universally consistent. As the number of training points and with , KNN converges to the Bayes optimal classifier. This means KNN can learn any decision boundary, given enough data.
The catch is the curse of dimensionality. In high dimensions, the notion of "nearest" breaks down because all points become approximately equidistant. Understanding this tradeoff between the theoretical beauty of consistency and the practical failure in high dimensions is essential.
Mental Model
Imagine the training points scattered in space, each colored by its class. To classify a new point, draw a small ball around it, find the nearest training points inside (or nearest to) that ball, and predict the majority class. As you get more data, the ball shrinks, and the local vote becomes a better estimate of the true class probability at that point.
Formal Setup and Notation
We have training data with and (for classes). Let be the joint distribution over .
K-Nearest Neighbors Classifier
Given a query point , let be the training points closest to in Euclidean distance (or another metric), with corresponding labels . The KNN classifier predicts:
That is, predict the class that appears most frequently among the nearest neighbors.
Bayes Optimal Classifier
The Bayes optimal classifier predicts the class with the highest posterior probability:
Its error rate is the Bayes risk, the lowest achievable error rate for any classifier.
Core Definitions
Lazy Learning
KNN is a lazy learner: it performs no computation during the training phase (just stores the data). All computation is deferred to prediction time. This contrasts with eager learners like SVMs or neural networks that build a model during training.
Curse of Dimensionality (for KNN)
In high dimensions, the volume of a ball grows exponentially with . To capture a fixed fraction of the data in a ball of radius in , we need . When is large, even a ball containing a small fraction of points has a radius close to the diameter of the entire dataset. All points become nearly equidistant, and the concept of "nearest" loses meaning.
Main Theorems
1-NN Error Bound (Cover and Hart)
Statement
Let denote the asymptotic error rate of the 1-nearest neighbor classifier as . Then:
where is the Bayes risk and is the number of classes. For binary classification (): .
Intuition
The 1-NN classifier effectively sees two independent draws from the distribution near each query point: the query itself and its nearest neighbor. It gets the answer right when both draws agree (which happens most of the time if is small). The factor of 2 arises because 1-NN uses one neighbor's label as a "noisy proxy" for the Bayes decision.
Proof Sketch
As , the nearest neighbor almost surely (by density of the training set). So is approximately an independent draw from . The 1-NN classifier errs when the nearest neighbor has the wrong label. Compute this probability using the conditional class probabilities and bound using the Bayes risk.
Why It Matters
This theorem says 1-NN is at most twice as bad as the Bayes optimal classifier asymptotically. For a method with zero parameters and zero assumptions about the data distribution, this is surprisingly good. It establishes KNN as a strong baseline.
Failure Mode
The bound is asymptotic (). In finite samples and high dimensions, 1-NN can be far worse than because the nearest neighbor may not actually be "near" in any meaningful sense.
Universal Consistency of KNN (Stone's Theorem)
Statement
If satisfies and as , then the -NN classifier is universally consistent:
as , for any underlying distribution .
Intuition
As grows, the nearest neighbors all converge to the query point (because , so the neighborhood shrinks). The majority vote over labels then estimates accurately (because gives the law of large numbers). The combination drives the error to the Bayes rate.
Proof Sketch
Stone (1977) proved a general result: any weighted local averaging rule is universally consistent if the weights satisfy certain conditions (they sum to 1, each weight goes to 0, and the weights concentrate near the query point). KNN with , satisfies these conditions. The proof uses the fact that the nearest-neighbor distance shrinks to zero while the number of neighbors grows, ensuring both locality and averaging.
Why It Matters
Universal consistency means KNN can learn any decision boundary given enough data, with no assumptions on the data distribution. This is a property that parametric models (like logistic regression) do not have. It places KNN among the theoretically most powerful classifiers.
Failure Mode
Universal consistency is an asymptotic property. The rate of convergence can be extremely slow in high dimensions. The sample size needed to achieve a given accuracy grows exponentially with dimension (curse of dimensionality).
The Curse of Dimensionality in Detail
Consider points uniformly distributed in the unit cube . To capture a fraction of the data in a hypercube neighborhood, the side length must be . For (1% of data) in dimensions, the side length is . The "local" neighborhood spans 95.5% of the range of each feature. The neighborhood is not local at all.
This has a concrete consequence: in high dimensions, the distance from a query point to its nearest neighbor and its farthest neighbor become nearly equal. Formally, for random points in :
under mild conditions. When all distances are approximately equal, ranking by distance is meaningless.
Computational Considerations
The naive KNN classifier requires time per query: compute the distance from the query to all training points in dimensions. This is expensive for large datasets.
Data structures that speed up nearest neighbor search:
- KD-trees: partition space along coordinate axes. Average query time in low dimensions, but degrades to when is large.
- Ball trees: partition space using nested hyperspheres. Better than KD-trees in moderate dimensions.
- Locality-sensitive hashing (LSH): approximate nearest neighbors in sublinear time. Trades exactness for speed.
Canonical Examples
KNN on the iris dataset
The iris dataset has points in dimensions with classes. KNN with achieves about 97% accuracy. The low dimension means distance is meaningful, and the classes are well-separated. This is the ideal regime for KNN: moderate , low , clear structure.
KNN fails on high-dimensional text
In text classification with bag-of-words features, can be 10,000+. Most documents have similar cosine distances to any query document. KNN with Euclidean distance performs poorly. Using cosine similarity or TF-IDF weighting partially alleviates this, but KNN is generally outperformed by methods like naive Bayes or SVMs in this regime.
Common Confusions
The choice of k is not about bias-variance in the usual parametric sense
Increasing does not add "model complexity". KNN has no parameters. Instead, controls the smoothness of the decision boundary. Small gives a jagged boundary (high variance, low bias). Large gives a smooth boundary (low variance, high bias). But this is local smoothing, not parameter estimation.
KNN is sensitive to the distance metric, not just k
The choice of distance metric (Euclidean, Manhattan, Mahalanobis, cosine) profoundly affects KNN. Features on different scales make Euclidean distance dominated by the largest-scale feature. Always normalize features or use a learned metric.
Summary
- KNN predicts by majority vote of the nearest training points
- No training phase (lazy learning); all computation at prediction time
- 1-NN asymptotic error is at most (Cover and Hart)
- Universal consistency: KNN converges to Bayes optimal if ,
- Curse of dimensionality: in high , all points are equidistant
- Naive prediction cost is per query; KD-trees and LSH help in low/moderate
Exercises
Problem
If the Bayes risk is for a binary classification problem, what is the upper bound on the asymptotic 1-NN error rate?
Problem
You have points in dimensions and use . Does this choice of satisfy the conditions for universal consistency? What about ?
Problem
Show quantitatively why KNN breaks in high dimensions. Compute the expected distance from the origin to its nearest neighbor among points drawn uniformly from the unit cube . How does this scale with ?
References
Canonical:
- Cover & Hart, "Nearest neighbor pattern classification" (1967)
- Stone, "Consistent nonparametric regression" (1977)
- Fix & Hodges, "Discriminatory analysis: nonparametric discrimination" (1951)
Current:
-
Hastie, Tibshirani, Friedman, The Elements of Statistical Learning (2009), Chapter 13
-
Shalev-Shwartz & Ben-David, Understanding Machine Learning (2014), Chapter 19
-
Bishop, Pattern Recognition and Machine Learning (2006), Chapters 1-14
Next Topics
The natural next steps from KNN:
- Bias-variance tradeoff: understanding the effect of on KNN error
- Decision trees and ensembles: another nonparametric method with very different properties
Last reviewed: April 2026