LLM Construction
Positional Encoding
Why attention needs position information, sinusoidal encoding, learned positions, RoPE (rotary position encoding via 2D rotations), ALiBi, and why RoPE became the default for modern LLMs.
Prerequisites
Why This Matters
Self-attention is permutation-equivariant: it treats the input as a set, not a sequence. Without positional information, a transformer cannot distinguish "the dog bit the man" from "the man bit the dog". both produce structurally identical attention patterns. Positional encoding is how transformers learn about word order, and the choice of encoding scheme has profound effects on context length generalization, training stability, and model quality.
This topic covers the mathematical theory behind the major positional encoding schemes, from the original sinusoidal encoding to RoPE (which is now the de facto standard) and ALiBi.
Mental Model
Imagine each token in a sequence wears a jersey with its position number. The attention mechanism needs to see these jerseys to know who came first, second, third. There are three structurally different ways to assign jerseys:
- Absolute position (sinusoidal/learned): Add a position-specific vector directly to each token embedding. Position 5 always gets the same vector, regardless of context.
- Relative position (RoPE): Encode positions so that the attention score between tokens depends only on their relative distance, not their absolute positions.
- Attention bias (ALiBi): Do not modify embeddings at all. Instead, add a distance-dependent penalty directly to the attention logits.
Why Position Information Is Needed
Permutation Equivariance of Attention
For a permutation matrix , self-attention satisfies:
This means that permuting the input tokens permutes the output tokens in the same way. The attention operation itself encodes no information about which token came first. It treats position 1 and position 1000 identically.
Without positional encoding, a transformer is a function on multisets, not sequences. It could learn that "cat" and "sat" appear together but not that "cat" precedes "sat." For language (where word order is essential for meaning) and for autoregressive generation (where the model must predict the next token specifically), position information is mandatory.
Sinusoidal Positional Encoding
Sinusoidal Positional Encoding
Vaswani et al. (2017) proposed adding fixed sinusoidal vectors to the input embeddings. For position and dimension :
The input to the transformer becomes , where each row of is the positional encoding for that position.
Why sinusoids? Each pair of dimensions oscillates at a different frequency . Low-frequency dimensions change slowly across positions (useful for representing coarse position), while high-frequency dimensions change rapidly (useful for fine-grained position distinctions). Together, the dimensions form a unique "fingerprint" for each position.
Sinusoidal Encoding Enables Linear Position Offsets
Statement
For any fixed offset , the positional encoding at position is a linear function of the encoding at position . Specifically, for each frequency band :
where .
The matrix is a rotation matrix . So shifting position by is equivalent to rotating each 2D frequency component by .
Intuition
The sinusoidal encoding represents position as a point on a collection of circles (one per frequency band). Moving forward by positions rotates the point on each circle by a frequency-dependent angle. Since rotation is a linear operation, the model can learn to detect relative position offsets using linear projections. no nonlinear computation needed.
Why It Matters
This linear relationship means that a linear attention head can, in principle, learn to attend based on relative position. The rotation structure of the sinusoidal encoding is the direct precursor to RoPE, which takes this idea further by applying rotations to queries and keys rather than adding vectors to embeddings.
Failure Mode
In practice, the additive sinusoidal encoding works poorly for long sequences. Because position information is mixed into the token embedding by addition, the model must disentangle content and position information. For long sequences, the position signal becomes a small perturbation on the content signal, making it hard for the model to use. Modern LLMs have abandoned sinusoidal encoding in favor of RoPE.
Learned Absolute Positions
Learned Positional Embedding
Instead of fixed sinusoidal vectors, learn a position embedding table where is the maximum sequence length. The input becomes:
where selects the first rows for a sequence of length .
Learned positions are more flexible than sinusoidal (the model can learn any position pattern) but have a critical limitation: they cannot extrapolate beyond . Position 10001 has no embedding if the model was trained with . This makes context length extension impossible without retraining.
GPT-2 and the original GPT-3 used learned absolute positions. BERT also uses learned positions. Modern LLMs have moved away from this approach.
Rotary Position Embedding (RoPE)
Rotary Position Embedding (RoPE)
RoPE (Su et al., 2021) encodes position by rotating the query and key vectors rather than adding to the input embeddings.
For a query vector at position and key vector at position , RoPE applies position-dependent rotations:
where is a block-diagonal rotation matrix. In each 2D block :
with frequencies (same base frequencies as sinusoidal encoding).
The full rotation matrix is block-diagonal with such blocks.
RoPE Encodes Relative Position in Attention Scores
Statement
The attention logit between position (query) and position (key) under RoPE depends only on the relative distance :
This follows from the orthogonality of rotation matrices: .
Explicitly, the attention logit decomposes into terms, one per frequency band:
Each term depends on the relative position and the frequency .
Intuition
Rotating the query by angle and the key by angle , then taking their dot product, gives a result that depends on the angle difference . This is exactly how relative position should work: the attention between tokens 5 and 3 should be the same as between tokens 105 and 103, because the relative offset is the same.
Each frequency band captures a different scale of relative position: high-frequency bands distinguish nearby positions, low-frequency bands distinguish distant positions. Together, the frequency bands provide a rich encoding of relative distance.
Proof Sketch
The rotation matrix is orthogonal () and satisfies the group property .
Therefore: .
The dot product: .
This depends on and only through .
Expanding the 2D rotation for block : rotates the 2D sub-vector by angle . The dot product of the rotated sub-vectors gives the cosine and sine terms in the explicit formula.
Why It Matters
RoPE is used in Llama, Mistral, Qwen, and virtually all modern open-source LLMs. GPT-NeoX-20B (Black et al. 2022, arXiv 2204.06745) was the first large-scale decoder-only LM to use RoPE, predating Llama. Its success comes from three properties: (1) it naturally encodes relative position without any additive modification to the embeddings, (2) base RoPE adds zero trainable parameters (variants like YaRN and learned RoPE do add small numbers of trainable parameters, so this property is not universal across the RoPE family), and (3) it composes cleanly with the attention mechanism because the position information lives in the phase of the query-key dot product rather than in the amplitude of the embeddings.
Production models often increase the RoPE base to improve long-context behavior: Llama 3 (Dubey et al. 2024) uses base , and Mistral-7B uses . A larger base stretches the low-frequency wavelengths so that longer absolute positions still correspond to angles seen in training.
Failure Mode
RoPE's context length extrapolation is limited. The base frequencies span a wide range: is the highest frequency (shortest period ), and is the lowest frequency (longest period , which for is on the order of positions). The real failure at test-time lengths beyond training is not literal wrap-around of angles past , but distribution shift: the model encounters angle combinations that never appeared during training, and attention heads trained on one angular regime behave unpredictably on another. Techniques like NTK-aware scaling, YaRN (Peng 2023, arXiv 2309.00071), and dynamic NTK interpolation modify the frequency base to keep test-time angles inside the training distribution without full retraining.
ALiBi: Attention with Linear Biases
ALiBi Provides Position via Attention Bias
Statement
ALiBi (Press et al., 2022) does not modify the input embeddings or the Q/K projections at all. Instead, it adds a position-dependent bias directly to the attention logits:
where is a head-specific slope and is the absolute distance between positions (query) and (key). Press et al. 2022 set the slopes as a geometric sequence starting at with common ratio : for head , the slope is .
The linear bias penalizes distant tokens, making attention prefer nearby positions. Different heads use different slopes, so some heads attend locally and others more globally.
Intuition
ALiBi says: "All else being equal, prefer to attend to nearby tokens." The linear penalty acts as a soft window. Tokens that are far away must have very high content similarity (large ) to overcome the distance penalty. This is a reasonable inductive bias: in natural language, nearby words are usually more relevant than distant ones.
Why It Matters
ALiBi was designed specifically for context length extrapolation. Because the position information is a simple additive bias rather than a modification to the embeddings, the scheme generalizes smoothly to positions never seen during training: position 20000 just gets a proportionally larger penalty. In the original perplexity experiments of Press et al. 2022, ALiBi extrapolated better than sinusoidal and learned positional encodings. Kazemnejad et al. 2023 (arXiv 2305.19466) later showed that the picture is task-dependent: on some length-generalization benchmarks (algorithmic tasks, certain reasoning tasks), ALiBi does not generalize better than other schemes, and in some cases no positional encoding at all outperforms ALiBi. Treat the "extrapolates" claim as "extrapolates on perplexity for natural language, sometimes fails elsewhere."
Failure Mode
ALiBi assumes that attention should decay with distance, which is not always true. Tasks requiring long-range dependencies (e.g., matching an opening bracket with a closing bracket 10000 tokens later) are penalized by the linear bias. RoPE does not have this bias: tokens at any distance can attend to each other based purely on content similarity. This is one reason RoPE has become more popular than ALiBi for general-purpose LLMs.
Context Length Extrapolation
A critical practical question: can a model trained on sequences of length perform well on sequences of length ?
- Learned positions: No extrapolation. Positions beyond have no embedding.
- Sinusoidal: Theoretically extrapolates (the formula works for any position) but performs poorly in practice.
- ALiBi: Good extrapolation for moderate extensions (2-4x) because the linear bias generalizes smoothly.
- RoPE (vanilla, i.e., the original formulation without frequency scaling): Degrades beyond training length due to unseen rotation angles.
- RoPE + frequency scaling (YaRN, NTK-aware): Extends RoPE by modifying the frequency base. NTK-aware scaling (bloc97, 2023) replaces the base with for context-length scaling factor , so . The exponent is chosen so that the highest-frequency band (smallest ) is nearly unchanged while lower-frequency bands stretch roughly by the target factor , keeping short-range position resolution intact while extending long-range reach. Naively setting compresses high-frequency bands too aggressively and degrades local attention. YaRN (Peng et al. 2023) refines NTK-aware scaling further by splitting the frequency spectrum and scaling low-frequency bands while leaving high-frequency bands alone.
Modern long-context models (128K+ tokens) typically use RoPE with some form of frequency scaling, sometimes combined with continued pretraining on longer sequences.
Common Confusions
RoPE modifies Q and K, not the input embeddings
Sinusoidal and learned positional encodings are added to the input before any projections. RoPE is applied after the Q and K projections but before the dot product. This is a crucial distinction: RoPE does not pollute the residual stream with position information. The values are completely position-free in RoPE, which means the information flowing through residual connections is purely semantic.
RoPE and sinusoidal use the same frequencies for different purposes
Both use . But sinusoidal encoding adds to the embedding, while RoPE rotates the Q and K vectors by angle in each 2D subspace. The rotation approach is strictly better because it encodes relative position in the attention logit directly, without contaminating the embedding space.
ALiBi and RoPE are not easily comparable
ALiBi modifies attention logits; RoPE modifies Q and K vectors. They encode different inductive biases: ALiBi assumes attention should decay with distance; RoPE assumes only that attention should depend on relative position, not absolute position. Neither is strictly better; the choice depends on the use case. In practice, RoPE dominates because it is more expressive (the model can learn to attend to distant tokens when needed).
Summary
- Attention is permutation-equivariant: without position encoding, transformers cannot represent word order
- Sinusoidal: add vectors at different frequencies to embeddings; linear offset relationship
- Learned positions: flexible but cannot extrapolate beyond training length
- RoPE: rotate Q and K by position-dependent angles; dot product depends on relative position
- RoPE key identity: (orthogonal group property)
- ALiBi: add linear distance penalty to attention logits with head slopes ; good perplexity extrapolation on language modeling but length generalization is task-dependent (Kazemnejad et al. 2023)
- RoPE is the default for modern LLMs (Llama, Mistral, Qwen)
- Context length extension: modify RoPE frequency base (NTK-aware, YaRN) to stretch wavelengths
Exercises
Problem
For sinusoidal positional encoding with (so two frequency bands), compute the encoding vectors for positions 0, 1, and 2. Verify that the encoding at position 2 can be obtained from the encoding at position 0 by a linear transformation (rotation in each 2D subspace).
Problem
Prove that RoPE attention scores depend only on relative position. Starting from and where is a block-diagonal rotation matrix, show that and that this depends on and only through their difference.
Problem
A model is trained with RoPE using base frequency and context length 4096. You want to extend it to 32768 tokens (8x) without retraining. The NTK-aware scaling approach (bloc97 2023) replaces with for context-extension factor . For and , compute and show the effect on the frequency bands. Which frequency bands are barely changed and which are stretched most, and why is this the right direction (in contrast to the naive choice )?
Related Comparisons
References
Canonical:
- Vaswani et al., "Attention Is All You Need" (2017). sinusoidal positional encoding
- Su et al., "RoFormer: Enhanced Transformer with Rotary Position Embedding" (2021). RoPE
Current:
- Press, Smith, Lewis, "Train Short, Test Long: Attention with Linear Biases Enables Input Length Extrapolation" (2022), arXiv 2108.12409. ALiBi. Head slopes .
- Kazemnejad et al., "The Impact of Positional Encoding on Length Generalization in Transformers" (2023), arXiv 2305.19466. Shows ALiBi length generalization is task-dependent.
- Peng et al., "YaRN: Efficient Context Window Extension of Large Language Models" (2023), arXiv 2309.00071.
- bloc97, "NTK-Aware Scaled RoPE" (2023). NTK-aware interpolation. Base rescaling .
- Black et al., "GPT-NeoX-20B: An Open-Source Autoregressive Language Model" (2022), arXiv 2204.06745. First large-scale decoder-only LM to use RoPE.
- Dubey et al., "The Llama 3 Herd of Models" (2024). RoPE base for long-context training.
- Jiang et al., "Mistral 7B" (2023), arXiv 2310.06825. RoPE base .
Next Topics
Positional encoding connects to:
- Attention mechanism theory: the attention operation that positional encoding modifies
- KV cache: how position encoding interacts with cached key-value pairs during generation
Last reviewed: April 2026
Prerequisites
Foundations this topic depends on.
- Attention Mechanism TheoryLayer 4
- Matrix Operations and PropertiesLayer 0A
- Sets, Functions, and RelationsLayer 0A
- Basic Logic and Proof TechniquesLayer 0A
- Softmax and Numerical StabilityLayer 1