Beta. Content is under active construction and has not been peer-reviewed. Report errors on GitHub.Disclaimer

LLM Construction

Attention Sinks and Retrieval Decay

Why transformers disproportionately attend to initial tokens (attention sinks), how StreamingLLM exploits this for infinite-length inference, and how retrieval accuracy degrades with distance and position within the context window.

AdvancedTier 2Frontier~45 min

Why This Matters

Large language models have context windows of 100K+ tokens, but they do not use all positions equally. Two empirical phenomena constrain what LLMs can actually do with long contexts:

  1. Attention sinks: models allocate disproportionate attention to the first few tokens regardless of their semantic content.
  2. Retrieval decay: the ability to find and use information degrades based on where that information sits in the context.

These are not bugs in specific models. They appear across architectures and training procedures. Understanding them is necessary for designing effective prompts, building retrieval-augmented systems, and understanding the limitations of in-context learning.

Mental Model

Picture a person reading a very long document. They remember the beginning well (primacy), remember the end well (recency), and forget things in the middle. Transformers exhibit a similar pattern, but for a different reason: the attention mechanism, not memory decay, creates the bias.

The first token becomes a "sink" for attention mass that has nowhere else to go. Middle positions get less attention than beginning or end positions. The result is a U-shaped retrieval curve across position.

Attention Sinks

Definition

Attention Sink

An attention sink is a token position (typically the first token or the BOS token) that receives a disproportionately large share of attention weight across many heads and layers, regardless of the token's semantic content.

For a sequence of length nn, if αi,j\alpha_{i,j} denotes the attention weight from query position ii to key position jj, then position j=1j = 1 is a sink if:

1ni=1nαi,11n(n1)i=1nj1αi,j\frac{1}{n} \sum_{i=1}^{n} \alpha_{i,1} \gg \frac{1}{n(n-1)} \sum_{i=1}^{n} \sum_{j \neq 1} \alpha_{i,j}

Why does this happen? Softmax attention must produce a valid probability distribution over keys. When a query has no strong match among the keys, the attention mass must go somewhere. The initial tokens, having been seen by every subsequent query during autoregressive training, become a convenient "dump" for unused attention. The model learns this pattern during pretraining: the first token serves as a no-op attention target.

StreamingLLM

Definition

StreamingLLM

StreamingLLM is an inference strategy for processing sequences longer than the training context window. Instead of using a sliding window that evicts the oldest KV entries, StreamingLLM keeps the KV entries for a small number of initial "sink" tokens (typically 1 to 4) plus the most recent ww tokens.

The KV cache at step tt contains entries for positions {1,,k}{tw+1,,t}\{1, \ldots, k\} \cup \{t - w + 1, \ldots, t\} where kk is the number of sink tokens retained.

Standard sliding window KV cache (evict the oldest tokens) causes perplexity to spike when the initial tokens leave the window. StreamingLLM avoids this by always retaining the sink tokens.

Main Theorems

Proposition

Attention Sink Concentration

Statement

Consider a query qq attending to keys k1,,knk_1, \ldots, k_n under causal masking. If qTkjcq^T k_j \approx c for all jj (no key is strongly preferred), then the attention distribution is approximately uniform. However, during training, the model learns biases such that qTk1q^T k_1 is consistently larger than qTkjq^T k_j for j>1j > 1 when the query lacks a clear semantic match. This concentrates attention on position 1.

Empirically, for GPT-style models, the first token receives 5 to 20 times the average attention weight in layers 2 through 6.

Intuition

Softmax forces a valid distribution. The model needs a "default" key to attend to when nothing is relevant. The first token is visible to every query (it is never masked), so it becomes the universal default during training. The model learns key/query biases that make the first position attractive.

Proof Sketch

This is primarily an empirical observation. The mechanism can be understood through the softmax temperature: when all logits are similar, small additive biases in qTk1q^T k_1 create large effects on the probability of position 1. The model parameters encode these biases through training.

Why It Matters

Attention sinks explain why removing the first token from the KV cache during streaming inference causes perplexity to blow up. They also suggest that attention weight magnitude does not always indicate semantic relevance: high attention to the first token is structural, not semantic.

Failure Mode

Models trained with explicit "sink tokens" (e.g., a dedicated padding token at position 0) can mitigate this effect. Architectures using linear attention or non-softmax normalization may not exhibit attention sinks because they do not force a probability distribution.

Proposition

StreamingLLM Perplexity Stability

Statement

Let PPLfull(t)\text{PPL}_{\text{full}}(t) be the perplexity at position tt using the full KV cache (all prior tokens). Let PPLstream(t)\text{PPL}_{\text{stream}}(t) be the perplexity using only the sink tokens and the most recent ww tokens. Empirically:

PPLstream(t)PPLfull(t)+ϵ\text{PPL}_{\text{stream}}(t) \approx \text{PPL}_{\text{full}}(t) + \epsilon

where ϵ\epsilon is small (typically under 0.5 perplexity points) for w256w \geq 256 and k2k \geq 2 sink tokens. Without the sink tokens, perplexity diverges as tt exceeds the window size.

Intuition

The sink tokens stabilize the attention distribution. The recent tokens provide local context for prediction. Together they approximate the full cache well enough for language modeling, because most prediction depends on recent context and the model's parametric knowledge, not on tokens thousands of positions back.

Proof Sketch

Empirical result from Xiao et al. (2023). Tested across Llama-2, MPT, Falcon, and Pythia model families. Perplexity was measured on sequences of up to 4 million tokens.

Why It Matters

StreamingLLM enables deploying LLMs on infinite-length streams (e.g., ongoing conversations, log monitoring) with bounded memory. The KV cache stays at O(k+w)O(k + w) instead of growing linearly with sequence length.

Failure Mode

StreamingLLM does not actually use the information in evicted tokens. Tasks requiring retrieval from the distant past (e.g., "what was the third sentence of this conversation?") will fail. This is stable inference, not genuine long-range understanding.

Lost-in-the-Middle Effect

Liu et al. (2023) demonstrated a consistent pattern: when relevant information is placed at different positions in a long context, retrieval accuracy follows a U-shaped curve. Models retrieve best from the beginning and end of the context, and worst from the middle.

For a context of nn documents where one contains the answer:

  • Placing the answer in position 1: highest accuracy
  • Placing the answer in position nn: second-highest accuracy
  • Placing the answer near position n/2n/2: lowest accuracy

This pattern held across model sizes (4B to 70B) and context lengths (4K to 128K tokens). It is not a small effect: accuracy can drop by 20 to 40 percentage points between optimal and worst positions.

Common Confusions

Watch Out

Attention sinks are not a training bug

Attention sinks emerge from the interaction between softmax normalization and autoregressive masking. They appear in all standard transformer LLMs. Training longer or on more data does not eliminate them. They are a structural property of softmax attention with causal masks.

Watch Out

Long context window does not mean long context usage

A model with a 128K token context window can process 128K tokens. That does not mean it uses all 128K tokens effectively. The lost-in-the-middle effect shows that information in the middle of the context is partially ignored. Effective context length is shorter than the maximum context length.

Watch Out

StreamingLLM does not extend context understanding

StreamingLLM enables stable inference on arbitrarily long streams, but it only uses k+wk + w tokens of context at any point. It solves the inference stability problem, not the long-range retrieval problem.

Exercises

ExerciseCore

Problem

A transformer has a context window of 4096 tokens and uses a StreamingLLM cache with k=4k = 4 sink tokens and w=512w = 512 recent tokens. What is the total KV cache size? If the model processes a 1 million token stream, how much memory does this save compared to caching all tokens?

ExerciseAdvanced

Problem

You are building a retrieval-augmented generation system. You retrieve 20 relevant documents and concatenate them into the context. Based on the lost-in-the-middle effect, how should you order these documents to maximize the probability that the model uses the most relevant ones?

References

Canonical:

  • Xiao et al., Efficient Streaming Language Models with Attention Sinks (2023)
  • Liu et al., Lost in the Middle: How Language Models Use Long Contexts (2023)

Current:

  • Sun et al., A Length-Extrapolatable Transformer (2023)
  • Press et al., Train Short, Test Long: Attention with Linear Biases Enables Input Length Extrapolation (2022)

Next Topics

  • Context engineering: practical strategies for structuring prompts given these attention biases

Last reviewed: April 2026

Prerequisites

Foundations this topic depends on.

Next Topics