Beta. Content is under active construction and has not been peer-reviewed. Report errors on GitHub.Disclaimer

LLM Construction

Attention as Kernel Regression

Softmax attention viewed as Nadaraya-Watson kernel regression: the output at each position is a kernel-weighted average of values, with the softmax kernel K(q,k) = exp(q^T k / sqrt(d)). Connects attention to classical nonparametric statistics and motivates linear attention via random features.

AdvancedTier 3Current~45 min

Why This Matters

The self-attention mechanism in transformers computes a weighted average of value vectors, with weights determined by query-key similarity. This is exactly the structure of Nadaraya-Watson kernel regression, a classical nonparametric estimator from the 1960s. Recognizing this connection provides three benefits: it explains why attention works (kernel smoothing is a well-studied and principled estimator), it clarifies the role of the d\sqrt{d} scaling factor, and it motivates linear-time attention approximations via random features.

Mental Model

In Nadaraya-Watson regression, you predict a value at a query point by taking a weighted average of observed values, where the weights depend on the distance between the query and each observed input. In attention, you predict the output at a query position by taking a weighted average of value vectors, where the weights depend on the similarity between the query and each key. The softmax over query-key dot products defines an implicit kernel function.

Nadaraya-Watson Kernel Regression

Definition

Nadaraya-Watson Estimator

Given data pairs (x1,y1),,(xn,yn)(x_1, y_1), \ldots, (x_n, y_n) and a kernel function K:X×XR0K: \mathcal{X} \times \mathcal{X} \to \mathbb{R}_{\geq 0}, the Nadaraya-Watson estimator at a query point xx is:

f^(x)=i=1nK(x,xi)yij=1nK(x,xj)\hat{f}(x) = \frac{\sum_{i=1}^n K(x, x_i) \, y_i}{\sum_{j=1}^n K(x, x_j)}

This is a kernel-weighted average: each yiy_i contributes in proportion to how similar xix_i is to xx under the kernel KK.

Attention as Kernel Smoothing

Standard single-head self-attention computes, for position ii in a sequence of length nn:

Attn(Q,K,V)i=j=1nexp(qikj/dk)vjj=1nexp(qikj/dk)\text{Attn}(Q, K, V)_i = \frac{\sum_{j=1}^n \exp(q_i^\top k_j / \sqrt{d_k}) \, v_j}{\sum_{j=1}^n \exp(q_i^\top k_j / \sqrt{d_k})}

where qi=xiWQq_i = x_i W_Q, kj=xjWKk_j = x_j W_K, vj=xjWVv_j = x_j W_V.

Proposition

Softmax Attention is Nadaraya-Watson Regression with the Softmax Kernel

Statement

Softmax attention is a Nadaraya-Watson estimator with:

  • Query points: qiq_i (projected inputs)
  • Data points: (kj,vj)(k_j, v_j) (projected key-value pairs)
  • Kernel function: Ksoft(q,k)=exp(qk/dk)K_{\text{soft}}(q, k) = \exp(q^\top k / \sqrt{d_k})

Specifically:

Attn(Q,K,V)i=jKsoft(qi,kj)vjjKsoft(qi,kj)=f^NW(qi)\text{Attn}(Q, K, V)_i = \frac{\sum_j K_{\text{soft}}(q_i, k_j) \, v_j}{\sum_j K_{\text{soft}}(q_i, k_j)} = \hat{f}_{\text{NW}}(q_i)

The attention weights αij=Ksoft(qi,kj)/lKsoft(qi,kl)\alpha_{ij} = K_{\text{soft}}(q_i, k_j) / \sum_l K_{\text{soft}}(q_i, k_l) are the normalized kernel weights.

Intuition

Attention literally is kernel smoothing. The query asks "what is the value here?", the keys index where information is stored, and the kernel determines how much each stored value contributes to the answer. Keys similar to the query contribute more. The softmax ensures the weights sum to 1, making the output a convex combination of values, just like Nadaraya-Watson.

Proof Sketch

This is a direct identification of terms. The softmax attention formula and the Nadaraya-Watson formula have identical structure once you define Ksoft(q,k)=exp(qk/dk)K_{\text{soft}}(q, k) = \exp(q^\top k / \sqrt{d_k}). The only step is verifying that KsoftK_{\text{soft}} is a valid kernel (positive definite), which follows from the fact that the exponential of a positive definite kernel is positive definite (Schur product theorem for power series).

Why It Matters

This identification connects attention to 60 years of nonparametric statistics. Properties of kernel smoothers carry over: attention is a consistent estimator of the conditional expectation E[VQ=q]\mathbb{E}[V | Q = q] under appropriate regularity conditions. The bandwidth parameter is 1/dk1/\sqrt{d_k}: larger dkd_k means narrower kernel, sharper attention weights, and more "peaked" retrieval.

Failure Mode

The analogy is exact for the functional form but differs in key respects. In classical kernel regression, the data points (xi,yi)(x_i, y_i) are fixed and the kernel is chosen by the statistician. In attention, the "data points" (keys and values) are themselves learned functions of the input, and the "kernel" is parameterized by WQW_Q and WKW_K. The learned nature of keys and values is what gives attention its expressive power beyond fixed kernel regression.

The Softmax Kernel

Definition

Softmax Kernel

The softmax kernel is:

Ksoft(q,k)=exp(qkdk)K_{\text{soft}}(q, k) = \exp\left(\frac{q^\top k}{\sqrt{d_k}}\right)

This is not a standard kernel in the Mercer sense on the original input space X\mathcal{X}. It is a kernel on the projected space Rdk\mathbb{R}^{d_k} where queries and keys live. As a function of the inner product qkq^\top k, it is a positive definite kernel by the Schur product theorem (the exponential of a PD kernel is PD).

The 1/dk1/\sqrt{d_k} scaling acts as an inverse bandwidth parameter. Without it, for random qq and kk in Rdk\mathbb{R}^{d_k}, the dot product qkq^\top k has variance proportional to dkd_k. The scaling normalizes the variance to O(1)O(1), keeping the softmax in a regime where gradients are well-behaved. This is analogous to choosing the bandwidth in kernel density estimation: too large and the kernel is flat (uniform attention); too small and the kernel is a spike (attention on a single token).

Random Feature Approximation and Linear Attention

The quadratic cost O(n2dk)O(n^2 d_k) of standard attention comes from computing all pairwise kernel values Ksoft(qi,kj)K_{\text{soft}}(q_i, k_j). Random feature approximation replaces the exact kernel with an approximate factored form.

Proposition

Random Features Approximate Softmax Attention in Linear Time

Statement

If there exists a random feature map ϕ:RdkRD\phi: \mathbb{R}^{d_k} \to \mathbb{R}^D such that:

E[ϕ(q)ϕ(k)]=Ksoft(q,k)=exp(qk/dk)\mathbb{E}[\phi(q)^\top \phi(k)] = K_{\text{soft}}(q, k) = \exp(q^\top k / \sqrt{d_k})

then attention can be approximated as:

Attn(Q,K,V)iϕ(qi)jϕ(kj)vjϕ(qi)jϕ(kj)\text{Attn}(Q, K, V)_i \approx \frac{\phi(q_i)^\top \sum_j \phi(k_j) v_j^\top}{\phi(q_i)^\top \sum_j \phi(k_j)}

The key observation: jϕ(kj)vj\sum_j \phi(k_j) v_j^\top and jϕ(kj)\sum_j \phi(k_j) can be precomputed once in O(nDdv)O(nDd_v) time. Then each query is processed in O(Ddv)O(Dd_v) time. Total cost: O(nDdv)O(nDd_v) instead of O(n2dk)O(n^2 d_k). When DnD \ll n, this is linear in sequence length.

Intuition

The trick is to factor the kernel: instead of computing all n2n^2 pairwise kernel values, decompose K(qi,kj)ϕ(qi)ϕ(kj)K(q_i, k_j) \approx \phi(q_i)^\top \phi(k_j). Then the attention sum becomes a matrix multiplication that can be rearranged to avoid the n×nn \times n attention matrix. This is the same idea as using random Fourier features to speed up kernel methods.

Why It Matters

Performers (Choromanski et al., 2021) use positive random features ϕ(x)=exp(x2/2)[exp(w1x),,exp(wDx)]/D\phi(x) = \exp(-\|x\|^2/2) \cdot [\exp(w_1^\top x), \ldots, \exp(w_D^\top x)] / \sqrt{D} where wiN(0,Idk/dk)w_i \sim \mathcal{N}(0, I_{d_k}/\sqrt{d_k}). This gives unbiased nonnegative estimates of the softmax kernel and enables linear-time attention. Random Feature Attention (RFA) uses a similar approach. These methods make long-context transformers computationally feasible.

Failure Mode

The approximation quality depends on DD. For small DD, the variance of the random feature estimate is high, and the approximated attention weights can differ substantially from exact softmax attention. In practice, Performers require DD on the order of dklogdkd_k \log d_k or larger to match the quality of exact attention. For many practical sequence lengths (n<8192n < 8192), exact attention with FlashAttention is faster than random-feature approximation. Linear attention methods have not replaced exact attention in state-of-the-art models.

What the Kernel View Explains

Why attention is permutation equivariant. Nadaraya-Watson regression does not depend on the ordering of the data points, only on their kernel similarity to the query. Similarly, attention (without positional encoding) is permutation equivariant.

Why scaling by 1/dk1/\sqrt{d_k} matters. In kernel regression, the bandwidth controls the bias-variance tradeoff. Too large a bandwidth (small dkd_k scaling) gives uniform weights and high bias. Too small a bandwidth (large dkd_k scaling) gives peaked weights and high variance. The 1/dk1/\sqrt{d_k} scaling normalizes the variance of the dot product so that the softmax operates in a balanced regime.

Why multi-head attention helps. Multiple heads correspond to multiple kernel regressors with different learned kernels (different WQW_Q, WKW_K projections). Each head can specialize in detecting different types of patterns, like using a mixture of kernels with different bandwidths.

Common Confusions

Watch Out

The softmax kernel is not a Gaussian RBF kernel

The Gaussian RBF kernel is K(x,y)=exp(xy2/2σ2)K(x, y) = \exp(-\|x - y\|^2 / 2\sigma^2). The softmax kernel is K(q,k)=exp(qk/dk)K(q, k) = \exp(q^\top k / \sqrt{d_k}). These are related but different. Expanding qk2=q22qk+k2\|q - k\|^2 = \|q\|^2 - 2q^\top k + \|k\|^2, the Gaussian kernel becomes exp(q2/2σ2)exp(qk/σ2)exp(k2/2σ2)\exp(-\|q\|^2/2\sigma^2) \cdot \exp(q^\top k/\sigma^2) \cdot \exp(-\|k\|^2/2\sigma^2). The softmax kernel is the middle factor, without the norm-dependent terms. This distinction matters for the random feature analysis.

Watch Out

Linear attention is not just removing the softmax

Removing softmax and computing Q(KV)Q(K^\top V) directly gives linear attention in O(ndkdv)O(n d_k d_v) time. But this changes the kernel: without softmax, the implicit kernel is just the dot product K(q,k)=qkK(q,k) = q^\top k, which can be negative and is not a proper density kernel. The resulting attention weights are not a valid probability distribution. Random-feature-based linear attention approximates the softmax kernel, preserving the probabilistic interpretation while achieving linear cost.

Summary

  • Softmax attention = Nadaraya-Watson kernel regression with kernel K(q,k)=exp(qk/dk)K(q,k) = \exp(q^\top k / \sqrt{d_k})
  • Attention weights are normalized kernel weights summing to 1
  • The 1/dk1/\sqrt{d_k} scaling is an inverse bandwidth parameter
  • The softmax kernel is positive definite by the Schur product theorem
  • Random feature approximation: factor the kernel to avoid the n×nn \times n attention matrix
  • Performers use positive random features for O(nDdv)O(nDd_v) linear-time attention
  • Multi-head attention corresponds to a mixture of kernels with different learned parameters

Exercises

ExerciseCore

Problem

Show that when all query-key dot products are equal (qikj=cq_i^\top k_j = c for all jj), softmax attention reduces to a uniform average of values. What does this correspond to in the kernel regression interpretation?

ExerciseAdvanced

Problem

In standard attention, the complexity is O(n2dk)O(n^2 d_k) because we compute the n×nn \times n attention matrix. In the random feature approximation with feature dimension DD, the complexity is O(nDdv)O(nDd_v). For what relationship between nn and DD does the random feature approach become faster? If D=dklogdkD = d_k \log d_k (a typical requirement for good approximation), for what sequence lengths nn is the approximation beneficial?

References

Canonical:

  • Nadaraya, "On Estimating Regression" (Theory of Probability and its Applications, 1964)
  • Watson, "Smooth Regression Analysis" (Sankhya, 1964)

Current:

  • Tsai et al., "Transformer Dissection: A Unified Understanding of Transformer's Attention via the Lens of Kernel" (2019)
  • Choromanski et al., "Rethinking Attention with Performers" (2021)
  • Katharopoulos et al., "Transformers are RNNs: Fast Autoregressive Transformers with Linear Attention" (2020)

Next Topics

The natural next steps from attention as kernel regression:

  • Neural tangent kernel: another connection between neural networks and kernel methods

Last reviewed: April 2026

Prerequisites

Foundations this topic depends on.