K-Nearest Neighbors

Sneiderman, Robby

ML Methods

K-Nearest Neighbors

Classify by majority vote of the k closest training points: no training phase, universal consistency as n and k grow, and the curse of dimensionality that makes distance meaningless in high dimensions.

CoreTier 2StableSupporting~40 min

Prerequisites

Common Probability Distributions Order Statistics

Start 8-question practice · 2 available Prereq Map

Learning position

Read this page in the graph.

ml-methods | layer 1 | tier 2. This page has 2 direct prerequisites and 3 published dependents.

Open Atlas Prerequisites Leads to

What next

Bias-Variance Tradeoff

This is the first curated or graph-derived continuation from the current page.

Evidence badge

Claim status

This page has no public Lean mapping yet. Use the evidence page to inspect how claim status labels work.

Show the backing system

AtlasOpen the full prerequisite graph and run grounding traces.EvidenceInspect source support, claim labels, and public trust status.LeanReview the checked declaration list, scopes, and axiom profile.

Why This Matters

K-nearest neighbors is the simplest nonparametric classifier: store all the training data, and at prediction time, find the $k$ closest points and take a majority vote. There is no training phase at all.

Despite this simplicity, KNN has a remarkable theoretical property: it is universally consistent. As the number of training points $n \to \infty$ and $k \to \infty$ with $k/n \to 0$ , KNN converges to the Bayes optimal classifier. This means KNN can learn any decision boundary, given enough data.

The catch is the curse of dimensionality. In high dimensions, the notion of "nearest" breaks down because all points become approximately equidistant. Understanding this tradeoff between the theoretical beauty of consistency and the practical failure in high dimensions is essential.

Mental Model

Imagine the training points scattered in space, each colored by its class. To classify a new point, draw a small ball around it, find the $k$ nearest training points inside (or nearest to) that ball, and predict the majority class. As you get more data, the ball shrinks, and the local vote becomes a better estimate of the true class probability at that point.

Formal Setup and Notation

We have training data $\{(x_1, y_1), \ldots, (x_n, y_n)\}$ with $x_i \in \mathbb{R}^d$ and $y_i \in \{1, \ldots, C\}$ (for $C$ classes). Let $\mathcal{D}$ be the joint distribution over $(X, Y)$ .

Definition

K-Nearest Neighbors Classifier

Given a query point $x$ , let $x_{(1)}, x_{(2)}, \ldots, x_{(k)}$ be the $k$ training points closest to $x$ in Euclidean distance (or another metric), with corresponding labels $y_{(1)}, \ldots, y_{(k)}$ . The KNN classifier predicts:

$\hat{y}(x) = \arg\max_{c \in \{1,\ldots,C\}} \sum_{j=1}^{k} \mathbf{1}[y_{(j)} = c]$

That is, predict the class that appears most frequently among the $k$ nearest neighbors.

Definition

Bayes Optimal Classifier

The Bayes optimal classifier predicts the class with the highest posterior probability:

$h^*(x) = \arg\max_c \, P(Y = c \mid X = x)$

Its error rate $R^* = \mathbb{E}[\mathbf{1}[h^*(X) \neq Y]]$ is the Bayes risk, the lowest achievable error rate for any classifier.

Core Definitions

Definition

Lazy Learning

KNN is a lazy learner: it performs no computation during the training phase (just stores the data). All computation is deferred to prediction time. This contrasts with eager learners like SVMs or neural networks that build a model during training.

Definition

Curse of Dimensionality (for KNN)

In high dimensions, the volume of a ball grows exponentially with $d$ . To capture a fixed fraction $f$ of the data in a ball of radius $r$ in $\mathbb{R}^d$ , we need $r \propto f^{1/d}$ . When $d$ is large, even a ball containing a small fraction of points has a radius close to the diameter of the entire dataset. All points become nearly equidistant, and the concept of "nearest" loses meaning.

Main Theorems

Theorem

1-NN Error Bound (Cover and Hart)

Statement

Let $R_{1\text{NN}}$ denote the asymptotic error rate of the 1-nearest neighbor classifier as $n \to \infty$ . Then:

$R^* \leq R_{1\text{NN}} \leq 2R^* - \frac{C}{C-1}(R^*)^2 \leq 2R^*$

where $R^*$ is the Bayes risk and $C$ is the number of classes. For binary classification ( $C = 2$ ): $R_{1\text{NN}} \leq 2R^*(1 - R^*)$ .

Intuition

The 1-NN classifier effectively sees two independent draws from the distribution near each query point: the query itself and its nearest neighbor. It gets the answer right when both draws agree (which happens most of the time if $R^*$ is small). The factor of 2 arises because 1-NN uses one neighbor's label as a "noisy proxy" for the Bayes decision.

Proof Sketch

As $n \to \infty$ , the nearest neighbor $x_{(1)} \to x$ almost surely (by density of the training set). So $Y_{(1)}$ is approximately an independent draw from $P(Y | X = x)$ . The 1-NN classifier errs when the nearest neighbor has the wrong label. Compute this probability using the conditional class probabilities and bound using the Bayes risk.

Why It Matters

This theorem says 1-NN is at most twice as bad as the Bayes optimal classifier asymptotically. For a method with zero parameters and zero assumptions about the data distribution, this is surprisingly good. It establishes KNN as a strong baseline.

Failure Mode

The bound is asymptotic ( $n \to \infty$ ). In finite samples and high dimensions, 1-NN can be far worse than $2R^*$ because the nearest neighbor may not actually be "near" in any meaningful sense.

report a correction →

Theorem

Universal Consistency of KNN (Stone's Theorem)

Statement

If $k = k(n)$ satisfies $k(n) \to \infty$ and $k(n)/n \to 0$ as $n \to \infty$ , then the $k$ -NN classifier is universally consistent:

$R_{k\text{NN}} \to R^*$

as $n \to \infty$ , for any underlying distribution $\mathcal{D}$ .

Intuition

As $n$ grows, the $k$ nearest neighbors all converge to the query point (because $k/n \to 0$ , so the neighborhood shrinks). The majority vote over $k$ labels then estimates $P(Y = c | X = x)$ accurately (because $k \to \infty$ gives the law of large numbers). The combination drives the error to the Bayes rate.

Proof Sketch

Stone (1977) proved a general result: any weighted local averaging rule is universally consistent if the weights satisfy certain conditions (they sum to 1, each weight goes to 0, and the weights concentrate near the query point). KNN with $k \to \infty$ , $k/n \to 0$ satisfies these conditions. The proof uses the fact that the nearest-neighbor distance shrinks to zero while the number of neighbors grows, ensuring both locality and averaging.

Why It Matters

Universal consistency means KNN can learn any decision boundary given enough data, with no assumptions on the data distribution. This is a property that parametric models (like logistic regression) do not have. It places KNN among the theoretically most powerful classifiers.

Failure Mode

Universal consistency is an asymptotic property. The rate of convergence can be extremely slow in high dimensions. The sample size needed to achieve a given accuracy grows exponentially with dimension $d$ (curse of dimensionality).

report a correction →

The Curse of Dimensionality in Detail

Consider points uniformly distributed in the unit cube $[0,1]^d$ . To capture a fraction $f$ of the data in a hypercube neighborhood, the side length must be $f^{1/d}$ . For $f = 0.01$ (1% of data) in $d = 100$ dimensions, the side length is $0.01^{1/100} \approx 0.955$ . The "local" neighborhood spans 95.5% of the range of each feature. The neighborhood is not local at all.

This has a concrete consequence: in high dimensions, the distance from a query point to its nearest neighbor and its farthest neighbor become nearly equal. Formally, for random points in $\mathbb{R}^d$ :

$\frac{d_{\max} - d_{\min}}{d_{\min}} \to 0 \quad \text{as } d \to \infty$

under mild conditions. When all distances are approximately equal, ranking by distance is meaningless.

Computational Considerations

The naive KNN classifier requires $O(nd)$ time per query: compute the distance from the query to all $n$ training points in $d$ dimensions. This is expensive for large datasets.

Data structures that speed up nearest neighbor search:

KD-trees: partition space along coordinate axes. Average query time $O(d \log n)$ in low dimensions, but degrades to $O(nd)$ when $d$ is large.
Ball trees: partition space using nested hyperspheres. Better than KD-trees in moderate dimensions.
Locality-sensitive hashing (LSH) (Indyk & Motwani, 1998): approximate nearest neighbors in sublinear time. Trades exactness for speed.

Modern Vector Search

Classical KNN with exact Euclidean distance is rarely used directly beyond a few hundred thousand points. Modern retrieval systems, including semantic search and embeddings pipelines and multimodal RAG, run approximate KNN ("ANN") over dense embedding vectors. Two dominant families:

Graph-based (HNSW): Hierarchical Navigable Small World graphs (Malkov & Yashunin, arXiv:1603.09320, TPAMI 2020). Hierarchy of proximity graphs; queries greedily descend. Default in FAISS, Milvus, Qdrant, Weaviate.
Quantization-based (IVF-PQ, ScaNN): inverted file index with product quantization (Jégou, Douze, Schmid, TPAMI 2011) and anisotropic vector quantization (Guo et al., ScaNN, ICML 2020). Scales to billion-vector corpora.

These methods target approximate recall@k rather than the exact nearest neighbor. Their theoretical behavior differs from classical KNN: consistency results for Stone-style averaging do not translate directly, and the curse of dimensionality re-emerges when embedding manifolds are nearly isotropic. The teaching connection is direct: every vector-DB query is a descendant of the KNN rule, with an index layer traded for speed.

Canonical Examples

Example

KNN on the iris dataset

The iris dataset has $n = 150$ points in $d = 4$ dimensions with $C = 3$ classes. KNN with $k = 5$ achieves about 97% accuracy. The low dimension means distance is meaningful, and the classes are well-separated. This is the ideal regime for KNN: moderate $n$ , low $d$ , clear structure.

Example

KNN fails on high-dimensional text

In text classification with bag-of-words features, $d$ can be 10,000+. Most documents have similar cosine distances to any query document. KNN with Euclidean distance performs poorly. Using cosine similarity or TF-IDF weighting partially alleviates this, but KNN is generally outperformed by methods like naive Bayes or SVMs in this regime.

Common Confusions

Watch Out

The choice of k is not about bias-variance in the usual parametric sense

Increasing $k$ does not add "model complexity". KNN has no parameters. Instead, $k$ controls the smoothness of the decision boundary. Small $k$ gives a jagged boundary (high variance, low bias). Large $k$ gives a smooth boundary (low variance, high bias). But this is local smoothing, not parameter estimation.

Watch Out

KNN is sensitive to the distance metric, not just k

The choice of distance metric (Euclidean, Manhattan, Mahalanobis, cosine) profoundly affects KNN. Features on different scales make Euclidean distance dominated by the largest-scale feature. Always normalize features or use a learned metric.

Summary

KNN predicts by majority vote of the $k$ nearest training points
No training phase (lazy learning); all computation at prediction time
1-NN asymptotic error is at most $2R^*$ (Cover and Hart)
Universal consistency: KNN converges to Bayes optimal if $k \to \infty$ , $k/n \to 0$
Curse of dimensionality: in high $d$ , all points are equidistant
Naive prediction cost is $O(nd)$ per query; KD-trees and LSH help in low/moderate $d$

Optional Deeper DetailDiscriminant adaptive nearest neighbors (DANN): learning a local metricShow

Advanced section adapted from Hastie, Tibshirani, Friedman, The Elements of Statistical Learning (2nd ed., Springer 2009), §13.4.1 "Discriminant Adaptive Nearest Neighbors," pp. 475-478, and Hastie, T. and Tibshirani, R. (1996), "Discriminant Adaptive Nearest Neighbor Classification," IEEE TPAMI 18(6), 607-616.

Euclidean KNN treats every coordinate direction as equally informative. When classes have different within-class variances or when discriminative directions are not axis-aligned, Euclidean distance is the wrong metric for classification. DANN learns a local Mahalanobis-style metric that stretches distance in directions where the class signal is weak and compresses it in directions of strong class separation.

Local-LDA setup. At a query point $x_0$ , compute weighted within-class and between-class scatter matrices using only the points in a local neighborhood (initial Euclidean k-NN of $x_0$ ). Let

$W \;=\; \sum_{c=1}^C \pi_c \, \Sigma_c, \qquad B \;=\; \sum_{c=1}^C \pi_c (\mu_c - \mu)(\mu_c - \mu)^\top$

be the local pooled within-class covariance and the local between-class covariance, with $\pi_c$ the local class proportions, $\mu_c$ the local class means, and $\mu$ the local overall mean.

The DANN metric at $x_0$ is

$\Sigma_{\text{DANN}}(x_0) \;=\; W^{-1/2}\, [W^{-1/2} B W^{-1/2} + \varepsilon I] \, W^{-1/2}$

where $\varepsilon$ is a small ridge regularizer (typically $\varepsilon = 1$ ). The DANN distance between $x_0$ and a candidate neighbor $x$ is

$d_{\text{DANN}}(x_0, x)^2 \;=\; (x - x_0)^\top \Sigma_{\text{DANN}}(x_0) (x - x_0).$

The construction has two graduate-grade ideas in one formula:

Sphering by $W^{-1/2}$ : in the sphered space, the within-class covariance is identity. Distance comparisons no longer favor low-variance directions just because they are tight; the metric now reflects true within-class proximity.
Adding $W^{-1/2} B W^{-1/2}$ : in the sphered space, this is the eigen-decomposition of the local LDA discriminant directions. Adding it stretches the metric along between-class directions, so a step orthogonal to the class boundary counts more than a step along it.

The $\varepsilon I$ regularizer keeps the metric well-conditioned when $B$ is rank-deficient (e.g., two classes in $d > 1$ dim makes $B$ rank-1).

Algorithm.

Find initial Euclidean k-NN of $x_0$ (e.g., $k = 50$ ).
Compute local $W$ and $B$ from those points.
Form $\Sigma_{\text{DANN}}(x_0)$ .
Re-rank the candidates by DANN distance and take the top $k$ for the final vote.

The local matrices are recomputed at every query, so DANN is more expensive than vanilla KNN. The payoff: ESL §13.4.1 reports substantial accuracy gains on classification tasks where Euclidean KNN underperforms LDA but the underlying boundary is non-linear (so global LDA also underperforms).

Connection to deep learning. Learned distance metrics for retrieval (siamese networks, triplet losses, metric learning) are the modern descendants of DANN. The DANN insight survives: a metric tuned to the local class structure outperforms a global metric tuned to the average.

Exercises

ExerciseCore

Problem

If the Bayes risk is $R^* = 0.1$ for a binary classification problem, what is the upper bound on the asymptotic 1-NN error rate?

ExerciseCore

Problem

You have $n = 1000$ points in $d = 2$ dimensions and use $k = 31$ . Does this choice of $k$ satisfy the conditions for universal consistency? What about $k = n = 1000$ ?

ExerciseAdvanced

Problem

Show quantitatively why KNN breaks in high dimensions. Compute the expected distance from the origin to its nearest neighbor among $n$ points drawn uniformly from the unit cube $[0,1]^d$ . How does this scale with $d$ ?

References

Canonical:

Fix & Hodges, "Discriminatory Analysis, Nonparametric Discrimination: Consistency Properties," USAF School of Aviation Medicine, Report 4 (1951); reprinted in Int. Statistical Review 57(3), 238-247 (1989).
Cover & Hart, "Nearest Neighbor Pattern Classification," IEEE Trans. Inf. Theory 13(1), 21-27 (1967). The $R_{1\text{NN}} \leq 2R^*$ bound.
Stone, "Consistent Nonparametric Regression," Annals of Statistics 5(4), 595-620 (1977). Universal consistency of weighted local averaging, specialized to k-NN.

Curse of dimensionality:

Beyer, Goldstein, Ramakrishnan, Shaft, "When Is 'Nearest Neighbor' Meaningful?," ICDT 1999, 217-235. Formalizes the $d_{\max}/d_{\min} \to 1$ distance-concentration result.

Approximate nearest neighbors:

Indyk & Motwani, "Approximate Nearest Neighbors: Towards Removing the Curse of Dimensionality," STOC 1998, 604-613. Locality-sensitive hashing.
Malkov & Yashunin, "Efficient and Robust Approximate Nearest Neighbor Search Using Hierarchical Navigable Small World Graphs," IEEE TPAMI 42(4), 824-836 (2020); arXiv:1603.09320. HNSW.
Johnson, Douze, Jégou, "Billion-scale similarity search with GPUs," IEEE Trans. Big Data 7(3), 535-547 (2021); arXiv:1702.08734. FAISS.
Guo, Sun, Lindgren, Geng, Simcha, Chern, Kumar, "Accelerating Large-Scale Inference with Anisotropic Vector Quantization," ICML 2020. ScaNN.

Textbook:

Hastie, T., Tibshirani, R., and Friedman, J. (2009). The Elements of Statistical Learning, 2nd ed. Springer. §2.3.2 (k-NN as a local prediction rule), §13.3 (k-NN classifiers, asymptotic bound, finite-sample rate), §13.3.2 (Cover-Hart and improvements), §13.4 (adaptive nearest neighbors including DANN), §13.5 (computational considerations including edited and condensed nearest neighbors).
Devroye, Györfi, Lugosi, A Probabilistic Theory of Pattern Recognition (Springer 1996), Chapters 5-6, 11. The definitive treatment of k-NN consistency rates.
Shalev-Shwartz & Ben-David, Understanding Machine Learning (Cambridge 2014), Chapter 19.

Next Topics

The natural next steps from KNN:

Bias-variance tradeoff: understanding the effect of $k$ on KNN error
Decision trees and ensembles: another nonparametric method with very different properties

Last reviewed: April 13, 2026

Canonical graph

Required before and derived from this topic

These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.

Full prerequisite chain All derived topics

Required prerequisites

2

Common Probability Distributionslayer 0A · tier 1
Order Statisticslayer 1 · tier 2

Derived topics

3

Bias-Variance Tradeofflayer 2 · tier 2
Decision Trees and Ensembleslayer 2 · tier 2
Semantic Search and Embeddingslayer 3 · tier 2

Graph-backed continuations

Bias-Variance Tradeoff Decision Trees and Ensembles Semantic Search and Embeddings