Skip to main content

ML Methods

K-Nearest Neighbors

Classify by majority vote of the k closest training points: no training phase, universal consistency as n and k grow, and the curse of dimensionality that makes distance meaningless in high dimensions.

CoreTier 2StableSupporting~40 min

Why This Matters

K-nearest neighbors is the simplest nonparametric classifier: store all the training data, and at prediction time, find the kk closest points and take a majority vote. There is no training phase at all.

Despite this simplicity, KNN has a remarkable theoretical property: it is universally consistent. As the number of training points nn \to \infty and kk \to \infty with k/n0k/n \to 0, KNN converges to the Bayes optimal classifier. This means KNN can learn any decision boundary, given enough data.

The catch is the curse of dimensionality. In high dimensions, the notion of "nearest" breaks down because all points become approximately equidistant. Understanding this tradeoff between the theoretical beauty of consistency and the practical failure in high dimensions is essential.

Mental Model

Imagine the training points scattered in space, each colored by its class. To classify a new point, draw a small ball around it, find the kk nearest training points inside (or nearest to) that ball, and predict the majority class. As you get more data, the ball shrinks, and the local vote becomes a better estimate of the true class probability at that point.

Formal Setup and Notation

We have training data {(x1,y1),,(xn,yn)}\{(x_1, y_1), \ldots, (x_n, y_n)\} with xiRdx_i \in \mathbb{R}^d and yi{1,,C}y_i \in \{1, \ldots, C\} (for CC classes). Let D\mathcal{D} be the joint distribution over (X,Y)(X, Y).

Definition

K-Nearest Neighbors Classifier

Given a query point xx, let x(1),x(2),,x(k)x_{(1)}, x_{(2)}, \ldots, x_{(k)} be the kk training points closest to xx in Euclidean distance (or another metric), with corresponding labels y(1),,y(k)y_{(1)}, \ldots, y_{(k)}. The KNN classifier predicts:

y^(x)=argmaxc{1,,C}j=1k1[y(j)=c]\hat{y}(x) = \arg\max_{c \in \{1,\ldots,C\}} \sum_{j=1}^{k} \mathbf{1}[y_{(j)} = c]

That is, predict the class that appears most frequently among the kk nearest neighbors.

Definition

Bayes Optimal Classifier

The Bayes optimal classifier predicts the class with the highest posterior probability:

h(x)=argmaxcP(Y=cX=x)h^*(x) = \arg\max_c \, P(Y = c \mid X = x)

Its error rate R=E[1[h(X)Y]]R^* = \mathbb{E}[\mathbf{1}[h^*(X) \neq Y]] is the Bayes risk, the lowest achievable error rate for any classifier.

Core Definitions

Definition

Lazy Learning

KNN is a lazy learner: it performs no computation during the training phase (just stores the data). All computation is deferred to prediction time. This contrasts with eager learners like SVMs or neural networks that build a model during training.

Definition

Curse of Dimensionality (for KNN)

In high dimensions, the volume of a ball grows exponentially with dd. To capture a fixed fraction ff of the data in a ball of radius rr in Rd\mathbb{R}^d, we need rf1/dr \propto f^{1/d}. When dd is large, even a ball containing a small fraction of points has a radius close to the diameter of the entire dataset. All points become nearly equidistant, and the concept of "nearest" loses meaning.

Main Theorems

Theorem

1-NN Error Bound (Cover and Hart)

Statement

Let R1NNR_{1\text{NN}} denote the asymptotic error rate of the 1-nearest neighbor classifier as nn \to \infty. Then:

RR1NN2RCC1(R)22RR^* \leq R_{1\text{NN}} \leq 2R^* - \frac{C}{C-1}(R^*)^2 \leq 2R^*

where RR^* is the Bayes risk and CC is the number of classes. For binary classification (C=2C = 2): R1NN2R(1R)R_{1\text{NN}} \leq 2R^*(1 - R^*).

Intuition

The 1-NN classifier effectively sees two independent draws from the distribution near each query point: the query itself and its nearest neighbor. It gets the answer right when both draws agree (which happens most of the time if RR^* is small). The factor of 2 arises because 1-NN uses one neighbor's label as a "noisy proxy" for the Bayes decision.

Proof Sketch

As nn \to \infty, the nearest neighbor x(1)xx_{(1)} \to x almost surely (by density of the training set). So Y(1)Y_{(1)} is approximately an independent draw from P(YX=x)P(Y | X = x). The 1-NN classifier errs when the nearest neighbor has the wrong label. Compute this probability using the conditional class probabilities and bound using the Bayes risk.

Why It Matters

This theorem says 1-NN is at most twice as bad as the Bayes optimal classifier asymptotically. For a method with zero parameters and zero assumptions about the data distribution, this is surprisingly good. It establishes KNN as a strong baseline.

Failure Mode

The bound is asymptotic (nn \to \infty). In finite samples and high dimensions, 1-NN can be far worse than 2R2R^* because the nearest neighbor may not actually be "near" in any meaningful sense.

Theorem

Universal Consistency of KNN (Stone's Theorem)

Statement

If k=k(n)k = k(n) satisfies k(n)k(n) \to \infty and k(n)/n0k(n)/n \to 0 as nn \to \infty, then the kk-NN classifier is universally consistent:

RkNNRR_{k\text{NN}} \to R^*

as nn \to \infty, for any underlying distribution D\mathcal{D}.

Intuition

As nn grows, the kk nearest neighbors all converge to the query point (because k/n0k/n \to 0, so the neighborhood shrinks). The majority vote over kk labels then estimates P(Y=cX=x)P(Y = c | X = x) accurately (because kk \to \infty gives the law of large numbers). The combination drives the error to the Bayes rate.

Proof Sketch

Stone (1977) proved a general result: any weighted local averaging rule is universally consistent if the weights satisfy certain conditions (they sum to 1, each weight goes to 0, and the weights concentrate near the query point). KNN with kk \to \infty, k/n0k/n \to 0 satisfies these conditions. The proof uses the fact that the nearest-neighbor distance shrinks to zero while the number of neighbors grows, ensuring both locality and averaging.

Why It Matters

Universal consistency means KNN can learn any decision boundary given enough data, with no assumptions on the data distribution. This is a property that parametric models (like logistic regression) do not have. It places KNN among the theoretically most powerful classifiers.

Failure Mode

Universal consistency is an asymptotic property. The rate of convergence can be extremely slow in high dimensions. The sample size needed to achieve a given accuracy grows exponentially with dimension dd (curse of dimensionality).

The Curse of Dimensionality in Detail

Consider points uniformly distributed in the unit cube [0,1]d[0,1]^d. To capture a fraction ff of the data in a hypercube neighborhood, the side length must be f1/df^{1/d}. For f=0.01f = 0.01 (1% of data) in d=100d = 100 dimensions, the side length is 0.011/1000.9550.01^{1/100} \approx 0.955. The "local" neighborhood spans 95.5% of the range of each feature. The neighborhood is not local at all.

This has a concrete consequence: in high dimensions, the distance from a query point to its nearest neighbor and its farthest neighbor become nearly equal. Formally, for random points in Rd\mathbb{R}^d:

dmaxdmindmin0as d\frac{d_{\max} - d_{\min}}{d_{\min}} \to 0 \quad \text{as } d \to \infty

under mild conditions. When all distances are approximately equal, ranking by distance is meaningless.

Computational Considerations

The naive KNN classifier requires O(nd)O(nd) time per query: compute the distance from the query to all nn training points in dd dimensions. This is expensive for large datasets.

Data structures that speed up nearest neighbor search:

  • KD-trees: partition space along coordinate axes. Average query time O(dlogn)O(d \log n) in low dimensions, but degrades to O(nd)O(nd) when dd is large.
  • Ball trees: partition space using nested hyperspheres. Better than KD-trees in moderate dimensions.
  • Locality-sensitive hashing (LSH) (Indyk & Motwani, 1998): approximate nearest neighbors in sublinear time. Trades exactness for speed.

Modern Vector Search

Classical KNN with exact Euclidean distance is rarely used directly beyond a few hundred thousand points. Modern retrieval systems, including semantic search and embeddings pipelines and multimodal RAG, run approximate KNN ("ANN") over dense embedding vectors. Two dominant families:

  • Graph-based (HNSW): Hierarchical Navigable Small World graphs (Malkov & Yashunin, arXiv:1603.09320, TPAMI 2020). Hierarchy of proximity graphs; queries greedily descend. Default in FAISS, Milvus, Qdrant, Weaviate.
  • Quantization-based (IVF-PQ, ScaNN): inverted file index with product quantization (Jégou, Douze, Schmid, TPAMI 2011) and anisotropic vector quantization (Guo et al., ScaNN, ICML 2020). Scales to billion-vector corpora.

These methods target approximate recall@k rather than the exact nearest neighbor. Their theoretical behavior differs from classical KNN: consistency results for Stone-style averaging do not translate directly, and the curse of dimensionality re-emerges when embedding manifolds are nearly isotropic. The teaching connection is direct: every vector-DB query is a descendant of the KNN rule, with an index layer traded for speed.

Canonical Examples

Example

KNN on the iris dataset

The iris dataset has n=150n = 150 points in d=4d = 4 dimensions with C=3C = 3 classes. KNN with k=5k = 5 achieves about 97% accuracy. The low dimension means distance is meaningful, and the classes are well-separated. This is the ideal regime for KNN: moderate nn, low dd, clear structure.

Example

KNN fails on high-dimensional text

In text classification with bag-of-words features, dd can be 10,000+. Most documents have similar cosine distances to any query document. KNN with Euclidean distance performs poorly. Using cosine similarity or TF-IDF weighting partially alleviates this, but KNN is generally outperformed by methods like naive Bayes or SVMs in this regime.

Common Confusions

Watch Out

The choice of k is not about bias-variance in the usual parametric sense

Increasing kk does not add "model complexity". KNN has no parameters. Instead, kk controls the smoothness of the decision boundary. Small kk gives a jagged boundary (high variance, low bias). Large kk gives a smooth boundary (low variance, high bias). But this is local smoothing, not parameter estimation.

Watch Out

KNN is sensitive to the distance metric, not just k

The choice of distance metric (Euclidean, Manhattan, Mahalanobis, cosine) profoundly affects KNN. Features on different scales make Euclidean distance dominated by the largest-scale feature. Always normalize features or use a learned metric.

Summary

  • KNN predicts by majority vote of the kk nearest training points
  • No training phase (lazy learning); all computation at prediction time
  • 1-NN asymptotic error is at most 2R2R^* (Cover and Hart)
  • Universal consistency: KNN converges to Bayes optimal if kk \to \infty, k/n0k/n \to 0
  • Curse of dimensionality: in high dd, all points are equidistant
  • Naive prediction cost is O(nd)O(nd) per query; KD-trees and LSH help in low/moderate dd
Optional Deeper DetailDiscriminant adaptive nearest neighbors (DANN): learning a local metricShow

Advanced section adapted from Hastie, Tibshirani, Friedman, The Elements of Statistical Learning (2nd ed., Springer 2009), §13.4.1 "Discriminant Adaptive Nearest Neighbors," pp. 475-478, and Hastie, T. and Tibshirani, R. (1996), "Discriminant Adaptive Nearest Neighbor Classification," IEEE TPAMI 18(6), 607-616.

Euclidean KNN treats every coordinate direction as equally informative. When classes have different within-class variances or when discriminative directions are not axis-aligned, Euclidean distance is the wrong metric for classification. DANN learns a local Mahalanobis-style metric that stretches distance in directions where the class signal is weak and compresses it in directions of strong class separation.

Local-LDA setup. At a query point x0x_0, compute weighted within-class and between-class scatter matrices using only the points in a local neighborhood (initial Euclidean k-NN of x0x_0). Let

W  =  c=1CπcΣc,B  =  c=1Cπc(μcμ)(μcμ)W \;=\; \sum_{c=1}^C \pi_c \, \Sigma_c, \qquad B \;=\; \sum_{c=1}^C \pi_c (\mu_c - \mu)(\mu_c - \mu)^\top

be the local pooled within-class covariance and the local between-class covariance, with πc\pi_c the local class proportions, μc\mu_c the local class means, and μ\mu the local overall mean.

The DANN metric at x0x_0 is

ΣDANN(x0)  =  W1/2[W1/2BW1/2+εI]W1/2\Sigma_{\text{DANN}}(x_0) \;=\; W^{-1/2}\, [W^{-1/2} B W^{-1/2} + \varepsilon I] \, W^{-1/2}

where ε\varepsilon is a small ridge regularizer (typically ε=1\varepsilon = 1). The DANN distance between x0x_0 and a candidate neighbor xx is

dDANN(x0,x)2  =  (xx0)ΣDANN(x0)(xx0).d_{\text{DANN}}(x_0, x)^2 \;=\; (x - x_0)^\top \Sigma_{\text{DANN}}(x_0) (x - x_0).

The construction has two graduate-grade ideas in one formula:

  1. Sphering by W1/2W^{-1/2}: in the sphered space, the within-class covariance is identity. Distance comparisons no longer favor low-variance directions just because they are tight; the metric now reflects true within-class proximity.

  2. Adding W1/2BW1/2W^{-1/2} B W^{-1/2}: in the sphered space, this is the eigen-decomposition of the local LDA discriminant directions. Adding it stretches the metric along between-class directions, so a step orthogonal to the class boundary counts more than a step along it.

The εI\varepsilon I regularizer keeps the metric well-conditioned when BB is rank-deficient (e.g., two classes in d>1d > 1 dim makes BB rank-1).

Algorithm.

  1. Find initial Euclidean k-NN of x0x_0 (e.g., k=50k = 50).
  2. Compute local WW and BB from those points.
  3. Form ΣDANN(x0)\Sigma_{\text{DANN}}(x_0).
  4. Re-rank the candidates by DANN distance and take the top kk for the final vote.

The local matrices are recomputed at every query, so DANN is more expensive than vanilla KNN. The payoff: ESL §13.4.1 reports substantial accuracy gains on classification tasks where Euclidean KNN underperforms LDA but the underlying boundary is non-linear (so global LDA also underperforms).

Connection to deep learning. Learned distance metrics for retrieval (siamese networks, triplet losses, metric learning) are the modern descendants of DANN. The DANN insight survives: a metric tuned to the local class structure outperforms a global metric tuned to the average.

Exercises

ExerciseCore

Problem

If the Bayes risk is R=0.1R^* = 0.1 for a binary classification problem, what is the upper bound on the asymptotic 1-NN error rate?

ExerciseCore

Problem

You have n=1000n = 1000 points in d=2d = 2 dimensions and use k=31k = 31. Does this choice of kk satisfy the conditions for universal consistency? What about k=n=1000k = n = 1000?

ExerciseAdvanced

Problem

Show quantitatively why KNN breaks in high dimensions. Compute the expected distance from the origin to its nearest neighbor among nn points drawn uniformly from the unit cube [0,1]d[0,1]^d. How does this scale with dd?

References

Canonical:

  • Fix & Hodges, "Discriminatory Analysis, Nonparametric Discrimination: Consistency Properties," USAF School of Aviation Medicine, Report 4 (1951); reprinted in Int. Statistical Review 57(3), 238-247 (1989).
  • Cover & Hart, "Nearest Neighbor Pattern Classification," IEEE Trans. Inf. Theory 13(1), 21-27 (1967). The R1NN2RR_{1\text{NN}} \leq 2R^* bound.
  • Stone, "Consistent Nonparametric Regression," Annals of Statistics 5(4), 595-620 (1977). Universal consistency of weighted local averaging, specialized to k-NN.

Curse of dimensionality:

  • Beyer, Goldstein, Ramakrishnan, Shaft, "When Is 'Nearest Neighbor' Meaningful?," ICDT 1999, 217-235. Formalizes the dmax/dmin1d_{\max}/d_{\min} \to 1 distance-concentration result.

Approximate nearest neighbors:

  • Indyk & Motwani, "Approximate Nearest Neighbors: Towards Removing the Curse of Dimensionality," STOC 1998, 604-613. Locality-sensitive hashing.
  • Malkov & Yashunin, "Efficient and Robust Approximate Nearest Neighbor Search Using Hierarchical Navigable Small World Graphs," IEEE TPAMI 42(4), 824-836 (2020); arXiv:1603.09320. HNSW.
  • Johnson, Douze, Jégou, "Billion-scale similarity search with GPUs," IEEE Trans. Big Data 7(3), 535-547 (2021); arXiv:1702.08734. FAISS.
  • Guo, Sun, Lindgren, Geng, Simcha, Chern, Kumar, "Accelerating Large-Scale Inference with Anisotropic Vector Quantization," ICML 2020. ScaNN.

Textbook:

  • Hastie, T., Tibshirani, R., and Friedman, J. (2009). The Elements of Statistical Learning, 2nd ed. Springer. §2.3.2 (k-NN as a local prediction rule), §13.3 (k-NN classifiers, asymptotic bound, finite-sample rate), §13.3.2 (Cover-Hart and improvements), §13.4 (adaptive nearest neighbors including DANN), §13.5 (computational considerations including edited and condensed nearest neighbors).
  • Devroye, Györfi, Lugosi, A Probabilistic Theory of Pattern Recognition (Springer 1996), Chapters 5-6, 11. The definitive treatment of k-NN consistency rates.
  • Shalev-Shwartz & Ben-David, Understanding Machine Learning (Cambridge 2014), Chapter 19.

Next Topics

The natural next steps from KNN:

Last reviewed: April 13, 2026

Canonical graph

Required before and derived from this topic

These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.

Required prerequisites

2

Derived topics

3