Beta. Content is under active construction and has not been peer-reviewed. Report errors on
GitHub
.
Disclaimer
Theorem
Path
Curriculum
Paths
Demos
Diagnostic
Search
Quiz Hub
/
Attention as Kernel Regression
Attention as Kernel Regression
1 questions
Difficulty 6-6
View topic
Intermediate
0 / 1
1 intermediate
Adapts to your performance
1 / 1
intermediate (6/10)
conceptual
Softmax attention can be interpreted as a Nadaraya-Watson kernel regression estimator with kernel
K
(
q
,
k
)
=
exp
(
q
T
k
/
d
)
. What does the
d
factor in the denominator correspond to in the kernel regression interpretation?
Hide and think first
A.
It converts the dot-product kernel into an equivalent radial basis function (RBF) kernel
B.
It acts as a bandwidth that normalizes dot-product variance and prevents softmax saturation
C.
It normalizes the kernel to a valid probability distribution by forcing the weights to sum to one
D.
It is a learned temperature parameter that adapts alongside the query and key projections
Submit Answer