Beta. Content is under active construction and has not been peer-reviewed. Report errors on
GitHub
.
Disclaimer
Theorem
Path
Curriculum
Paths
Demos
Diagnostic
Search
Quiz Hub
/
Attention Mechanism Theory
Attention Mechanism Theory
10 questions
Difficulty 4-9
View topic
Intermediate
0 / 10
8 intermediate
2 advanced
Adapts to your performance
1 / 10
intermediate (4/10)
conceptual
In the transformer self-attention mechanism, why are the attention scores divided by the square root of the key dimension before applying softmax?
Hide and think first
A.
The scaling ensures that the attention weights sum to exactly 1.0 after softmax, which would not be guaranteed without the normalization by the dimension
B.
The division reduces the computational cost of attention from quadratic to linear in the sequence length by approximating the full attention matrix
C.
Large dot products saturate the softmax function, making gradients vanishingly small. Dividing by sqrt(d_k) keeps the variance near 1 and softmax in its sensitive region
D.
It compensates for the fact that keys and queries are learned independently, and their dot products would otherwise have a bias toward positive values
Show Hint
Submit Answer