LLM Construction
Attention as Kernel Regression
Softmax attention viewed as Nadaraya-Watson kernel regression: the output at each position is a kernel-weighted average of values, with the softmax kernel K(q,k) = exp(q^T k / sqrt(d)). Connects attention to classical nonparametric statistics and motivates linear attention via random features.
Prerequisites
Why This Matters
The self-attention mechanism in transformers computes a weighted average of value vectors, with weights determined by query-key similarity. This is exactly the structure of Nadaraya-Watson kernel regression, a classical nonparametric estimator from the 1960s. Recognizing this connection provides three benefits: it explains why attention works (kernel smoothing is a well-studied and principled estimator), it clarifies the role of the scaling factor, and it motivates linear-time attention approximations via random features.
Mental Model
In Nadaraya-Watson regression, you predict a value at a query point by taking a weighted average of observed values, where the weights depend on the distance between the query and each observed input. In attention, you predict the output at a query position by taking a weighted average of value vectors, where the weights depend on the similarity between the query and each key. The softmax over query-key dot products defines an implicit kernel function.
Nadaraya-Watson Kernel Regression
Nadaraya-Watson Estimator
Given data pairs and a kernel function , the Nadaraya-Watson estimator at a query point is:
This is a kernel-weighted average: each contributes in proportion to how similar is to under the kernel .
Attention as Kernel Smoothing
Standard single-head self-attention computes, for position in a sequence of length :
where , , .
Softmax Attention is Nadaraya-Watson Regression with the Softmax Kernel
Statement
Softmax attention is a Nadaraya-Watson estimator with:
- Query points: (projected inputs)
- Data points: (projected key-value pairs)
- Kernel function:
Specifically:
The attention weights are the normalized kernel weights.
Intuition
Attention literally is kernel smoothing. The query asks "what is the value here?", the keys index where information is stored, and the kernel determines how much each stored value contributes to the answer. Keys similar to the query contribute more. The softmax ensures the weights sum to 1, making the output a convex combination of values, just like Nadaraya-Watson.
Proof Sketch
This is a direct identification of terms. The softmax attention formula and the Nadaraya-Watson formula have identical structure once you define . The only step is verifying that is a valid kernel (positive definite), which follows from the fact that the exponential of a positive definite kernel is positive definite (Schur product theorem for power series).
Why It Matters
This identification connects attention to 60 years of nonparametric statistics. Properties of kernel smoothers carry over: attention is a consistent estimator of the conditional expectation under appropriate regularity conditions. The bandwidth parameter is : larger means narrower kernel, sharper attention weights, and more "peaked" retrieval.
Failure Mode
The analogy is exact for the functional form but differs in key respects. In classical kernel regression, the data points are fixed and the kernel is chosen by the statistician. In attention, the "data points" (keys and values) are themselves learned functions of the input, and the "kernel" is parameterized by and . The learned nature of keys and values is what gives attention its expressive power beyond fixed kernel regression.
The Softmax Kernel
Softmax Kernel
The softmax kernel is:
This is not a standard kernel in the Mercer sense on the original input space . It is a kernel on the projected space where queries and keys live. As a function of the inner product , it is a positive definite kernel by the Schur product theorem (the exponential of a PD kernel is PD).
The scaling acts as an inverse bandwidth parameter. Without it, for random and in , the dot product has variance proportional to . The scaling normalizes the variance to , keeping the softmax in a regime where gradients are well-behaved. This is analogous to choosing the bandwidth in kernel density estimation: too large and the kernel is flat (uniform attention); too small and the kernel is a spike (attention on a single token).
Random Feature Approximation and Linear Attention
The quadratic cost of standard attention comes from computing all pairwise kernel values . Random feature approximation replaces the exact kernel with an approximate factored form.
Random Features Approximate Softmax Attention in Linear Time
Statement
If there exists a random feature map such that:
then attention can be approximated as:
The key observation: and can be precomputed once in time. Then each query is processed in time. Total cost: instead of . When , this is linear in sequence length.
Intuition
The trick is to factor the kernel: instead of computing all pairwise kernel values, decompose . Then the attention sum becomes a matrix multiplication that can be rearranged to avoid the attention matrix. This is the same idea as using random Fourier features to speed up kernel methods.
Why It Matters
Performers (Choromanski et al., 2021) use positive random features where . This gives unbiased nonnegative estimates of the softmax kernel and enables linear-time attention. Random Feature Attention (RFA) uses a similar approach. These methods make long-context transformers computationally feasible.
Failure Mode
The approximation quality depends on . For small , the variance of the random feature estimate is high, and the approximated attention weights can differ substantially from exact softmax attention. In practice, Performers require on the order of or larger to match the quality of exact attention. For many practical sequence lengths (), exact attention with FlashAttention is faster than random-feature approximation. Linear attention methods have not replaced exact attention in state-of-the-art models.
What the Kernel View Explains
Why attention is permutation equivariant. Nadaraya-Watson regression does not depend on the ordering of the data points, only on their kernel similarity to the query. Similarly, attention (without positional encoding) is permutation equivariant.
Why scaling by matters. In kernel regression, the bandwidth controls the bias-variance tradeoff. Too large a bandwidth (small scaling) gives uniform weights and high bias. Too small a bandwidth (large scaling) gives peaked weights and high variance. The scaling normalizes the variance of the dot product so that the softmax operates in a balanced regime.
Why multi-head attention helps. Multiple heads correspond to multiple kernel regressors with different learned kernels (different , projections). Each head can specialize in detecting different types of patterns, like using a mixture of kernels with different bandwidths.
Common Confusions
The softmax kernel is not a Gaussian RBF kernel
The Gaussian RBF kernel is . The softmax kernel is . These are related but different. Expanding , the Gaussian kernel becomes . The softmax kernel is the middle factor, without the norm-dependent terms. This distinction matters for the random feature analysis.
Linear attention is not just removing the softmax
Removing softmax and computing directly gives linear attention in time. But this changes the kernel: without softmax, the implicit kernel is just the dot product , which can be negative and is not a proper density kernel. The resulting attention weights are not a valid probability distribution. Random-feature-based linear attention approximates the softmax kernel, preserving the probabilistic interpretation while achieving linear cost.
Summary
- Softmax attention = Nadaraya-Watson kernel regression with kernel
- Attention weights are normalized kernel weights summing to 1
- The scaling is an inverse bandwidth parameter
- The softmax kernel is positive definite by the Schur product theorem
- Random feature approximation: factor the kernel to avoid the attention matrix
- Performers use positive random features for linear-time attention
- Multi-head attention corresponds to a mixture of kernels with different learned parameters
Exercises
Problem
Show that when all query-key dot products are equal ( for all ), softmax attention reduces to a uniform average of values. What does this correspond to in the kernel regression interpretation?
Problem
In standard attention, the complexity is because we compute the attention matrix. In the random feature approximation with feature dimension , the complexity is . For what relationship between and does the random feature approach become faster? If (a typical requirement for good approximation), for what sequence lengths is the approximation beneficial?
References
Canonical:
- Nadaraya, "On Estimating Regression" (Theory of Probability and its Applications, 1964)
- Watson, "Smooth Regression Analysis" (Sankhya, 1964)
Current:
- Tsai et al., "Transformer Dissection: A Unified Understanding of Transformer's Attention via the Lens of Kernel" (2019)
- Choromanski et al., "Rethinking Attention with Performers" (2021)
- Katharopoulos et al., "Transformers are RNNs: Fast Autoregressive Transformers with Linear Attention" (2020)
Next Topics
The natural next steps from attention as kernel regression:
- Neural tangent kernel: another connection between neural networks and kernel methods
Last reviewed: April 2026
Prerequisites
Foundations this topic depends on.
- Attention Mechanism TheoryLayer 4
- Matrix Operations and PropertiesLayer 0A
- Sets, Functions, and RelationsLayer 0A
- Basic Logic and Proof TechniquesLayer 0A
- Softmax and Numerical StabilityLayer 1
- Kernels and Reproducing Kernel Hilbert SpacesLayer 3
- Convex Optimization BasicsLayer 1
- Differentiation in RnLayer 0A
- Rademacher ComplexityLayer 3
- Empirical Risk MinimizationLayer 2
- Concentration InequalitiesLayer 1
- Common Probability DistributionsLayer 0A
- Expectation, Variance, Covariance, and MomentsLayer 0A
- VC DimensionLayer 2