Beta. Content is under active construction and has not been peer-reviewed. Report errors on
GitHub
.
Disclaimer
Theorem
Path
Curriculum
Paths
Demos
Diagnostic
Search
Quiz Hub
/
Attention Sinks and Retrieval Decay
Attention Sinks and Retrieval Decay
3 questions
Difficulty 5-6
View topic
Intermediate
0 / 3
3 intermediate
Adapts to your performance
1 / 3
intermediate (5/10)
state theorem
Attention sinks (Xiao et al. 2023) are an empirical phenomenon in transformer language models. What are they?
Hide and think first
A.
Tokens that receive zero attention from any head, becoming 'dead' in the computation
B.
The information bottleneck where all context must flow through a single token's representation
C.
A few early tokens (often the first token) receive an outsized share of attention across heads and layers, acting as a 'sink' that stabilizes softmax computations
D.
Attention patterns where the last token dominates, making the model 'recency-biased'
Submit Answer