Attention as Kernel Regression

Sneiderman, Robby

LLM Construction

Attention as Kernel Regression

Softmax attention viewed as Nadaraya-Watson kernel regression: the output at each position is a kernel-weighted average of values. Connects attention to classical nonparametric statistics and motivates linear attention via random feature approximations.

AdvancedTier 3CurrentSupporting~45 min

Prerequisites

Attention Mechanism Theory Kernels and Rkhs

Start 8-question practice · 1 available Prereq Map

Learning position

Read this page in the graph.

llm-construction | layer 4 | tier 3. This page has 2 direct prerequisites and 0 published dependents.

Open Atlas Prerequisites Leads to

What next

Take the diagnostic

No published continuation is declared yet, so the diagnostic is the clean next route.

Evidence badge

Claim status

This page has no public Lean mapping yet. Use the evidence page to inspect how claim status labels work.

Show the backing system

AtlasOpen the full prerequisite graph and run grounding traces.EvidenceInspect source support, claim labels, and public trust status.LeanReview the checked declaration list, scopes, and axiom profile.

Why This Matters

The self-attention mechanism in transformers computes a weighted average of value vectors, with weights determined by query-key similarity. This is exactly the structure of Nadaraya-Watson kernel regression, a classical nonparametric estimator from the 1960s. Recognizing this connection provides three benefits: it explains why attention works (kernel smoothing is a well-studied and principled estimator), it clarifies the role of the $\sqrt{d}$ scaling factor, and it motivates linear-time attention approximations via random features.

Mental Model

In Nadaraya-Watson regression, you predict a value at a query point by taking a weighted average of observed values, where the weights depend on the distance between the query and each observed input. In attention, you predict the output at a query position by taking a weighted average of value vectors, where the weights depend on the similarity between the query and each key. The softmax over query-key dot products defines an implicit kernel function.

Nadaraya-Watson Kernel Regression

Definition

Nadaraya-Watson Estimator

Given data pairs $(x_1, y_1), \ldots, (x_n, y_n)$ and a kernel function $K: \mathcal{X} \times \mathcal{X} \to \mathbb{R}_{\geq 0}$ , the Nadaraya-Watson estimator at a query point $x$ is:

$\hat{f}(x) = \frac{\sum_{i=1}^n K(x, x_i) \, y_i}{\sum_{j=1}^n K(x, x_j)}$

This is a kernel-weighted average: each $y_i$ contributes in proportion to how similar $x_i$ is to $x$ under the kernel $K$ .

Attention as Kernel Smoothing

Standard single-head self-attention computes, for position $i$ in a sequence of length $n$ :

$\text{Attn}(Q, K, V)_i = \frac{\sum_{j=1}^n \exp(q_i^\top k_j / \sqrt{d_k}) \, v_j}{\sum_{j=1}^n \exp(q_i^\top k_j / \sqrt{d_k})}$

where $q_i = x_i W_Q$ , $k_j = x_j W_K$ , $v_j = x_j W_V$ .

Proposition

Softmax Attention is Nadaraya-Watson Regression with the Softmax Kernel

Statement

Softmax attention is a Nadaraya-Watson estimator with:

Query points: $q_i$ (projected inputs)
Data points: $(k_j, v_j)$ (projected key-value pairs)
Kernel function: $K_{\text{soft}}(q, k) = \exp(q^\top k / \sqrt{d_k})$

Specifically:

$\text{Attn}(Q, K, V)_i = \frac{\sum_j K_{\text{soft}}(q_i, k_j) \, v_j}{\sum_j K_{\text{soft}}(q_i, k_j)} = \hat{f}_{\text{NW}}(q_i)$

The attention weights $\alpha_{ij} = K_{\text{soft}}(q_i, k_j) / \sum_l K_{\text{soft}}(q_i, k_l)$ are the normalized kernel weights.

Intuition

Attention literally is kernel smoothing. The query asks "what is the value here?", the keys index where information is stored, and the kernel determines how much each stored value contributes to the answer. Keys similar to the query contribute more. The softmax ensures the weights sum to 1, making the output a convex combination of values, just like Nadaraya-Watson.

Proof Sketch

This is a direct identification of terms. The softmax attention formula and the Nadaraya-Watson formula have identical structure once you define $K_{\text{soft}}(q, k) = \exp(q^\top k / \sqrt{d_k})$ . The only step is verifying that $K_{\text{soft}}$ is a valid kernel (positive definite), which follows from the fact that the exponential of a positive definite kernel is positive definite (Schur product theorem for power series).

Why It Matters

This identification connects attention to 60 years of nonparametric statistics. Properties of kernel smoothers carry over: attention is a consistent estimator of the conditional expectation $\mathbb{E}[V | Q = q]$ under appropriate regularity conditions. The bandwidth parameter is $1/\sqrt{d_k}$ : larger $d_k$ means narrower kernel, sharper attention weights, and more "peaked" retrieval.

Failure Mode

The analogy is exact for the functional form but differs in key respects. In classical kernel regression, the data points $(x_i, y_i)$ are fixed and the kernel is chosen by the statistician. In attention, the "data points" (keys and values) are themselves learned functions of the input, and the "kernel" is parameterized by $W_Q$ and $W_K$ . The learned nature of keys and values is what gives attention its expressive power beyond fixed kernel regression.

report a correction →

The Softmax Kernel

Definition

Softmax Kernel $K_{soft}$

The softmax kernel is:

$K_{\text{soft}}(q, k) = \exp\left(\frac{q^\top k}{\sqrt{d_k}}\right)$

This is not a standard kernel in the Mercer sense on the original input space $\mathcal{X}$ . It is a kernel on the projected space $\mathbb{R}^{d_k}$ where queries and keys live. As a function of the inner product $q^\top k$ , it is a positive definite kernel by the Schur product theorem (the exponential of a PD kernel is PD).

The $1/\sqrt{d_k}$ scaling acts as an inverse bandwidth parameter. Without it, for random $q$ and $k$ in $\mathbb{R}^{d_k}$ , the dot product $q^\top k$ has variance proportional to $d_k$ . The scaling normalizes the variance to $O(1)$ , keeping the softmax in a regime where gradients are well-behaved. This is analogous to choosing the bandwidth in kernel density estimation: too large and the kernel is flat (uniform attention); too small and the kernel is a spike (attention on a single token).

Random Feature Approximation and Linear Attention

The quadratic cost $O(n^2 d_k)$ of standard attention comes from computing all pairwise kernel values $K_{\text{soft}}(q_i, k_j)$ . Random feature approximation replaces the exact kernel with an approximate factored form.

Proposition

Random Features Approximate Softmax Attention in Linear Time

Statement

If there exists a random feature map $\phi: \mathbb{R}^{d_k} \to \mathbb{R}^D$ such that:

$\mathbb{E}[\phi(q)^\top \phi(k)] = K_{\text{soft}}(q, k) = \exp(q^\top k / \sqrt{d_k})$

then attention can be approximated as:

$\text{Attn}(Q, K, V)_i \approx \frac{\phi(q_i)^\top \sum_j \phi(k_j) v_j^\top}{\phi(q_i)^\top \sum_j \phi(k_j)}$

The key observation: $\sum_j \phi(k_j) v_j^\top$ and $\sum_j \phi(k_j)$ can be precomputed once in $O(nDd_v)$ time. Then each query is processed in $O(Dd_v)$ time. Total cost: $O(nDd_v)$ instead of $O(n^2 d_k)$ . When $D \ll n$ , this is linear in sequence length.

Intuition

The trick is to factor the kernel: instead of computing all $n^2$ pairwise kernel values, decompose $K(q_i, k_j) \approx \phi(q_i)^\top \phi(k_j)$ . Then the attention sum becomes a matrix multiplication that can be rearranged to avoid the $n \times n$ attention matrix. This is the same idea as using random Fourier features to speed up kernel methods.

Why It Matters

Performers (Choromanski et al., 2021) use positive random features $\phi(x) = \exp(-\|x\|^2/2) \cdot [\exp(w_1^\top x), \ldots, \exp(w_D^\top x)] / \sqrt{D}$ with $w_i \sim \mathcal{N}(0, I_{d_k})$ . (The presentation here absorbs the $1/\sqrt{d_k}$ scaling into the queries and keys before applying $\phi$ ; equivalently, one can absorb it into the sampling covariance of $w_i$ . The original paper applies the scaling to $q, k$ directly. Pick one convention consistently.) This gives unbiased nonnegative estimates of the softmax kernel and enables linear-time attention. Random Feature Attention (Peng et al. 2021) uses a similar approach. These methods make long-context transformers computationally feasible.

Failure Mode

The approximation quality depends on $D$ . For small $D$ , the variance of the random feature estimate is high, and the approximated attention weights can differ substantially from exact softmax attention. In practice, Performers require $D$ on the order of $d_k \log d_k$ or larger to match the quality of exact attention. For many practical sequence lengths ( $n < 8192$ ), exact attention with FlashAttention is faster than random-feature approximation. Linear attention methods have not replaced exact attention in state-of-the-art models.

report a correction →

What the Kernel View Explains

Why attention is permutation equivariant. Nadaraya-Watson regression does not depend on the ordering of the data points, only on their kernel similarity to the query. Similarly, attention (without positional encoding) is permutation equivariant.

Why scaling by $1/\sqrt{d_k}$ matters. In kernel regression, the bandwidth controls the bias-variance tradeoff. Too large a bandwidth (small $d_k$ scaling) gives uniform weights and high bias. Too small a bandwidth (large $d_k$ scaling) gives peaked weights and high variance. The $1/\sqrt{d_k}$ scaling normalizes the variance of the dot product so that the softmax operates in a balanced regime.

Why multi-head attention helps. Multiple heads correspond to multiple kernel regressors with different learned kernels (different $W_Q$ , $W_K$ projections). Each head can specialize in detecting different types of patterns, like using a mixture of kernels with different bandwidths.

Common Confusions

Watch Out

The softmax kernel is not a Gaussian RBF kernel

The Gaussian RBF kernel is $K(x, y) = \exp(-\|x - y\|^2 / 2\sigma^2)$ . The softmax kernel is $K(q, k) = \exp(q^\top k / \sqrt{d_k})$ . These are related but different. Expanding $\|q - k\|^2 = \|q\|^2 - 2q^\top k + \|k\|^2$ , the Gaussian kernel becomes $\exp(-\|q\|^2/2\sigma^2) \cdot \exp(q^\top k/\sigma^2) \cdot \exp(-\|k\|^2/2\sigma^2)$ . The softmax kernel is the middle factor, without the norm-dependent terms. This distinction matters for the random feature analysis.

Watch Out

Linear attention is not just removing the softmax

Removing softmax and computing $Q(K^\top V)$ directly gives linear attention in $O(n d_k d_v)$ time. But this changes the kernel: without softmax, the implicit kernel is just the dot product $K(q,k) = q^\top k$ , which can be negative and is not a proper density kernel. The resulting attention weights are not a valid probability distribution. Random-feature-based linear attention approximates the softmax kernel, preserving the probabilistic interpretation while achieving linear cost.

Summary

Softmax attention = Nadaraya-Watson kernel regression with kernel $K(q,k) = \exp(q^\top k / \sqrt{d_k})$
Attention weights are normalized kernel weights summing to 1
The $1/\sqrt{d_k}$ scaling is an inverse bandwidth parameter
The softmax kernel is positive definite by the Schur product theorem
Random feature approximation: factor the kernel to avoid the $n \times n$ attention matrix
Performers use positive random features for $O(nDd_v)$ linear-time attention
Multi-head attention corresponds to a mixture of kernels with different learned parameters

Exercises

ExerciseCore

Problem

Show that when all query-key dot products are equal ( $q_i^\top k_j = c$ for all $j$ ), softmax attention reduces to a uniform average of values. What does this correspond to in the kernel regression interpretation?

ExerciseAdvanced

Problem

In standard attention, the complexity is $O(n^2 d_k)$ because we compute the $n \times n$ attention matrix. In the random feature approximation with feature dimension $D$ , the complexity is $O(nDd_v)$ . For what relationship between $n$ and $D$ does the random feature approach become faster? If $D = d_k \log d_k$ (a typical requirement for good approximation), for what sequence lengths $n$ is the approximation beneficial?

References

Canonical:

Nadaraya, "On Estimating Regression" (Theory of Probability and its Applications, 1964)
Watson, "Smooth Regression Analysis" (Sankhya, 1964)

Current:

Tsai et al., "Transformer Dissection: A Unified Understanding of Transformer's Attention via the Lens of Kernel" (EMNLP 2019, arXiv:1908.11775)
Choromanski et al., "Rethinking Attention with Performers" (ICLR 2021, arXiv:2009.14794)
Katharopoulos et al., "Transformers are RNNs: Fast Autoregressive Transformers with Linear Attention" (ICML 2020, arXiv:2006.16236)
Peng et al., "Random Feature Attention" (ICLR 2021, arXiv:2103.02143). The RFA variant referenced in the approximation discussion.
Schlag, Irie, Schmidhuber, "Linear Transformers Are Secretly Fast Weight Programmers" (ICML 2021, arXiv:2102.11174). Relates linear attention to the fast-weight memory literature.

Next Topics

The natural next steps from attention as kernel regression:

Neural tangent kernel: another connection between neural networks and kernel methods

Last reviewed: April 14, 2026

Canonical graph

Required before and derived from this topic

These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.

Full prerequisite chain All derived topics

Required prerequisites

2

Kernels and Reproducing Kernel Hilbert Spaceslayer 3 · tier 2
Attention Mechanism Theorylayer 4 · tier 2

Derived topics

0

No published topic currently declares this as a prerequisite.