Attention Sinks and Retrieval Decay

Sneiderman, Robby

LLM Construction

Attention Sinks and Retrieval Decay

Why evaluated transformer decoders disproportionately attend to initial tokens, how StreamingLLM uses that pattern for stable streaming inference, and how retrieval accuracy can vary with position inside the context window.

AdvancedTier 2FrontierFrontier watch~45 min

Prerequisites

Attention Mechanism Theory Fox Forget Gate

Start 8-question practice · 3 available Prereq Map

Learning position

Read this page in the graph.

llm-construction | layer 4 | tier 2. This page has 2 direct prerequisites and 1 published dependent.

Open Atlas Prerequisites Leads to

What next

Context Engineering

This is the first curated or graph-derived continuation from the current page.

Evidence badge

Claim status

This page has no public Lean mapping yet. Use the evidence page to inspect how claim status labels work.

Show the backing system

AtlasOpen the full prerequisite graph and run grounding traces.EvidenceInspect source support, claim labels, and public trust status.LeanReview the checked declaration list, scopes, and axiom profile.

Why This Matters

Some modern language models advertise context windows of 100K+ tokens, but the usable context can be shorter than the nominal limit. Two empirical phenomena constrain what long-context performance actually means:

Attention sinks: evaluated softmax-decoder models often allocate disproportionate attention to the first few tokens regardless of their semantic content.
Retrieval decay: the ability to find and use information can depend on where that information sits in the context.

These are related but distinct observations. Attention sinks were documented in StreamingLLM on open softmax-decoder families such as Llama-2, MPT, Falcon, and Pythia. Lost-in-the-middle was measured on specific long-context setups including GPT-3.5-Turbo-0613, Claude-1.3, LongChat-13B-16K, MPT-30B-Instruct, and Llama-2 checkpoints in Liu et al. The safe claim is not "all LLMs always behave this way," but that these positional effects are strong enough on evaluated model families to matter for prompt design, retrieval systems, and long-context evaluation. The May 2026 review kept this page scoped to the original measured setups rather than generalizing it to every new long-context model.

Mental Model

Picture a person reading a very long document. They may remember the beginning well, remember the end well, and lose track of facts placed in the middle. Transformers can show a similar position profile, but for a different reason: the attention mechanism and positional training setup, not human memory decay, create the bias.

In the evaluated settings, the first token can become a "sink" for attention mass that has no stronger semantic target. Retrieval benchmarks then often show better use of beginning and ending evidence than middle evidence. The result can be a U-shaped retrieval curve across position.

Attention Sinks

Definition

Attention Sink $s_{1}$

An attention sink is a token position (typically the first token or the BOS token) that receives a disproportionately large share of attention weight across many heads and layers, regardless of the token's semantic content.

For a sequence of length $n$ under causal masking, query $i$ attends only to keys $j \leq i$ . Let $\alpha_{i,j}$ denote the attention weight from query position $i$ to key position $j$ . The number of non-sink (query, key) pairs with $j \neq 1$ and $j \leq i$ is $\sum_{i=1}^{n} (i - 1) = n(n-1)/2$ . We call position $j = 1$ a sink when:

$\frac{1}{n} \sum_{i=1}^{n} \alpha_{i,1} \gg \frac{2}{n(n-1)} \sum_{i=1}^{n} \sum_{\substack{j \neq 1 \\ j \leq i}} \alpha_{i,j}.$

Why does this happen? Softmax attention must produce a valid probability distribution over keys. When a query has no strong match among the keys, the attention mass must go somewhere. The initial tokens, having been seen by every subsequent query during autoregressive training, can become default targets for unused attention. In the measured decoder families, the first token often functions as a no-op attention target visible to every query.

StreamingLLM

Definition

StreamingLLM

StreamingLLM is an inference strategy for processing sequences longer than the training context window. Instead of using a sliding window that evicts the oldest KV entries, StreamingLLM keeps the KV entries for a small number of initial "sink" tokens (typically 1 to 4) plus the most recent $w$ tokens.

The KV cache at step $t$ contains entries for positions $\{1, \ldots, k\} \cup \{t - w + 1, \ldots, t\}$ where $k$ is the number of sink tokens retained.

In Xiao et al.'s evaluated setups, a standard sliding-window KV cache that evicts the oldest tokens caused sharply higher perplexity once the initial tokens left the window. StreamingLLM avoids that failure mode by retaining the sink tokens.

Main Results

Proposition

Attention Sink Concentration

Statement

Consider a query $q$ attending to keys $k_1, \ldots, k_n$ under causal masking. If $q^T k_j \approx c$ for all $j$ (no key is strongly preferred), then the attention distribution is approximately uniform. In the evaluated decoder-only families, trained parameters often make $q^T k_1$ larger than $q^T k_j$ for $j > 1$ when the query lacks a clear semantic match. This can concentrate attention on position 1.

Empirically, on evaluated decoder-only families such as Llama-2, MPT, Falcon, and Pythia, the first token receives several times the average attention weight across most layers beyond the input layer (Xiao et al., arXiv 2023 / ICLR 2024).

Intuition

Softmax forces a valid distribution. The model needs a "default" key to attend to when nothing is relevant. The first token is visible to every query (it is never masked), so it is a natural learned default in the measured architectures. The model can learn key/query biases that make the first position attractive.

Proof Sketch

This is primarily an empirical observation. The mechanism can be understood through the softmax temperature: when all logits are similar, small additive biases in $q^T k_1$ create large effects on the probability of position 1. The model parameters encode these biases through training.

Why It Matters

Attention sinks help explain why removing the first token from the KV cache during streaming inference can sharply degrade language modeling in the evaluated setups. They also suggest that attention weight magnitude does not always indicate semantic relevance: high attention to the first token can be structural rather than semantic.

Failure Mode

Models trained with explicit "sink tokens" (e.g., a dedicated padding token at position 0) can mitigate this effect. Architectures using linear attention or non-softmax normalization may not exhibit attention sinks because they do not force a probability distribution.

report a correction →

Empirical Finding: StreamingLLM Perplexity Stability

Let $\text{PPL}_{\text{full}}(t)$ be the perplexity at position $t$ using the full KV cache (all prior tokens). Let $\text{PPL}_{\text{stream}}(t)$ be the perplexity using only the sink tokens and the most recent $w$ tokens. Xiao et al. (arXiv 2023 / ICLR 2024) report qualitatively, across Llama-2, MPT, Falcon, and Pythia model families and sequences of up to 4 million tokens, that the streaming perplexity tracks the full-cache perplexity:

$\log \text{PPL}_{\text{stream}}(t) \approx \log \text{PPL}_{\text{full}}(t) \qquad \text{(heuristic; no formal bound)},$

provided the retained cache contains a small set of initial sink tokens plus a recent window. This is a qualitative observation from the paper's perplexity figures, not a theorem or a portable threshold on $k$ and $w$ . Without the sink tokens, the reported sliding-window runs suffer sharply higher perplexity after $t$ exceeds the window size.

The sink tokens stabilize the attention distribution, and the recent tokens supply local context for prediction. In Xiao et al.'s language-modeling measurements, that retained subset approximates the full-cache perplexity well enough because many next-token predictions rely on recent context plus parametric knowledge rather than exact retrieval from evicted tokens.

This is an empirical observation, not a theorem: there is no known bound proving that the NLL gap is small for all inputs. It holds across the tested models and corpora, and can be violated on tasks that require retrieval from evicted positions.

What it buys. StreamingLLM supports stable language-model streaming far beyond the training window in the evaluated setups (ongoing conversations, log monitoring) with bounded memory: the KV cache stays at $O(k + w)$ instead of growing linearly.

What it does not buy. StreamingLLM does not use information in evicted tokens. Tasks requiring retrieval from the distant past (for example, "what was the third sentence of this conversation?") can fail. This is stable streaming inference, not genuine long-range understanding.

Lost-in-the-Middle Effect

Liu et al. (2023) reported a consistent pattern across their multi-document QA and key-value retrieval setups: when relevant information is placed at different positions in a long context, retrieval accuracy often follows a U-shaped curve. The evaluated models usually retrieve best from the beginning and end of the context, and worst from the middle.

For a context of $n$ documents where one contains the answer, the paper's evaluated setups often show:

Placing the answer in position 1: high accuracy
Placing the answer in position $n$ : high accuracy
Placing the answer near position $n/2$ : lower accuracy

Liu et al. measured this on multi-document QA and key-value retrieval across specific GPT-3.5, Claude-1.3, LongChat, MPT-30B-Instruct, and Llama-2 checkpoints at the time of writing (4B to 70B parameter range, 4K to 32K evaluated context lengths). The U-shaped positional effect was reproduced on those setups, with large position-dependent drops visible in the paper's task results. Whether the same quantitative profile holds at 128K or longer or on other task families (reasoning, code) is an open empirical question, and newer models with different long-context training (YaRN, RULER-tuned, sparse attention architectures) should be measured case by case.

What Has Actually Been Measured

Phenomenon	Primary source	Evaluated setups	Safe takeaway
Attention sinks in streaming inference	Xiao et al. (arXiv:2309.17453, ICLR 2024)	Llama-2, MPT, Falcon, and Pythia families under streaming language-model evaluation	Retaining a few initial sink tokens stabilizes streaming language modeling on those families; this is not the same as long-range retrieval.
Lost-in-the-middle retrieval decay	Liu et al. (arXiv:2307.03172, TACL 2024)	GPT-3.5-Turbo-0613, Claude-1.3, LongChat-13B-16K, MPT-30B-Instruct, and Llama-2 variants on multi-document QA and key-value retrieval	Retrieval accuracy often falls for evidence placed in middle positions on these tasks and models.
Useful long-context size	LongBench, RULER, BABILong	Task suites spanning retrieval, QA, and long-context reasoning on specific released checkpoints	Advertised window length should be checked against benchmarked retrieval and reasoning behavior, not treated as self-validating.

Common Confusions

Watch Out

Attention sinks are not a training bug

Attention sinks emerge from the interaction between softmax normalization and autoregressive masking. They have been measured across several standard softmax-decoder model families. Training longer or on more data is not known to remove them in those architectures. They are best treated as a recurring structural feature of softmax attention with causal masks, not as a one-off training accident.

Watch Out

Long context window does not mean long context usage

A model with a 128K token context window can process 128K tokens. That does not mean it uses all 128K tokens effectively. Lost-in-the-middle and RULER-style evaluations show that middle evidence and harder multi-hop tasks can be less reliable than the advertised window suggests. Effective context length should be measured task by task.

Watch Out

Attention sinks and lost-in-the-middle are not the same claim

Attention sinks are about where some heads send mass when no later token is a clear match. Lost-in-the-middle is a retrieval-performance result measured by placing relevant evidence at different positions. The first can help explain the second, but they are not interchangeable definitions.

Watch Out

StreamingLLM does not extend context understanding

StreamingLLM enables stable streaming inference beyond the training window in the evaluated setups, but it only uses $k + w$ tokens of context at any point. It solves the inference stability problem, not the long-range retrieval problem.

Exercises

ExerciseCore

Problem

A transformer uses a StreamingLLM cache with $k = 4$ sink tokens and $w = 512$ recent tokens. How many token positions are cached at any step, regardless of stream length? If the model processes a 1 million token stream, what fraction of positions would a full cache store relative to the StreamingLLM cache?

ExerciseAdvanced

Problem

You are building a retrieval-augmented generation system. You retrieve 20 relevant documents and concatenate them into the context. Based on the lost-in-the-middle effect, how should you order these documents to maximize the probability that the model uses the most relevant ones?

References

Canonical:

Xiao et al., Efficient Streaming Language Models with Attention Sinks, ICLR 2024 (arXiv:2309.17453).
Liu et al., Lost in the Middle: How Language Models Use Long Contexts, TACL 2024 (arXiv:2307.03172).
Han et al., LM-Infinite: Simple On-the-Fly Length Generalization for Large Language Models, NAACL 2024 (arXiv:2308.16137).
Gu et al., When Attention Sink Emerges in Language Models: An Empirical View (2024, arXiv:2410.10781).

Mechanism / sink theory:

Darcet et al., Vision Transformers Need Registers, ICLR 2024 (arXiv:2309.16588). Vision-side companion showing register tokens remove sink artifacts.
Sun et al., Massive Activations in Large Language Models (2024, arXiv:2402.17762). Sinks align with a small number of outlier activations that carry mass through residual streams.

Position-extension methods:

Press et al., Train Short, Test Long: Attention with Linear Biases Enables Input Length Extrapolation, ICLR 2022 (arXiv:2108.12409).
Sun et al., A Length-Extrapolatable Transformer, ACL 2023 (arXiv:2212.10554).
Peng et al., YaRN: Efficient Context Window Extension of Large Language Models (2023, arXiv:2309.00071).
Ding et al., LongRoPE: Extending LLM Context Window Beyond 2 Million Tokens, ICML 2024 (arXiv:2402.13753).

Long-context evaluation:

Bai et al., LongBench: A Bilingual, Multitask Benchmark for Long Context Understanding (2023, arXiv:2308.14508).
Hsieh et al., RULER: What's the Real Context Size of Your Long-Context Language Models? (2024, arXiv:2404.06654).
Kuratov et al., BABILong: Testing the Limits of LLMs with Long Context Reasoning-in-a-Haystack (2024, arXiv:2406.10149).

Next Topics

Context engineering: practical strategies for structuring prompts given these attention biases

Last reviewed: June 4, 2026

Canonical graph

Required before and derived from this topic

These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.

Full prerequisite chain All derived topics

Required prerequisites

2

Attention Mechanism Theorylayer 4 · tier 2
Forgetting Transformer (FoX)layer 4 · tier 2

Derived topics

1

Context Engineeringlayer 5 · tier 2

Graph-backed continuations

Context Engineering