Beta. Content is under active construction and has not been peer-reviewed. Report errors on GitHub.Disclaimer

LLM Construction

Positional Encoding

Why attention needs position information, sinusoidal encoding, learned positions, RoPE (rotary position encoding via 2D rotations), ALiBi, and why RoPE became the default for modern LLMs.

AdvancedTier 3Current~50 min
0

Why This Matters

Self-attention is permutation-equivariant: it treats the input as a set, not a sequence. Without positional information, a transformer cannot distinguish "the dog bit the man" from "the man bit the dog". both produce structurally identical attention patterns. Positional encoding is how transformers learn about word order, and the choice of encoding scheme has profound effects on context length generalization, training stability, and model quality.

This topic covers the mathematical theory behind the major positional encoding schemes, from the original sinusoidal encoding to RoPE (which is now the de facto standard) and ALiBi.

Mental Model

Imagine each token in a sequence wears a jersey with its position number. The attention mechanism needs to see these jerseys to know who came first, second, third. There are three structurally different ways to assign jerseys:

  1. Absolute position (sinusoidal/learned): Add a position-specific vector directly to each token embedding. Position 5 always gets the same vector, regardless of context.
  2. Relative position (RoPE): Encode positions so that the attention score between tokens depends only on their relative distance, not their absolute positions.
  3. Attention bias (ALiBi): Do not modify embeddings at all. Instead, add a distance-dependent penalty directly to the attention logits.

Why Position Information Is Needed

Definition

Permutation Equivariance of Attention

For a permutation matrix PP, self-attention satisfies:

Attention(PX)=PAttention(X)\text{Attention}(PX) = P \cdot \text{Attention}(X)

This means that permuting the input tokens permutes the output tokens in the same way. The attention operation itself encodes no information about which token came first. It treats position 1 and position 1000 identically.

Without positional encoding, a transformer is a function on multisets, not sequences. It could learn that "cat" and "sat" appear together but not that "cat" precedes "sat." For language (where word order is essential for meaning) and for autoregressive generation (where the model must predict the next token specifically), position information is mandatory.

Sinusoidal Positional Encoding

Definition

Sinusoidal Positional Encoding

Vaswani et al. (2017) proposed adding fixed sinusoidal vectors to the input embeddings. For position pos\text{pos} and dimension ii:

PE(pos,2i)=sin(pos100002i/d)\text{PE}(\text{pos}, 2i) = \sin\left(\frac{\text{pos}}{10000^{2i/d}}\right) PE(pos,2i+1)=cos(pos100002i/d)\text{PE}(\text{pos}, 2i+1) = \cos\left(\frac{\text{pos}}{10000^{2i/d}}\right)

The input to the transformer becomes X+PEX + \text{PE}, where each row of PE\text{PE} is the positional encoding for that position.

Why sinusoids? Each pair of dimensions (2i,2i+1)(2i, 2i+1) oscillates at a different frequency ωi=1/100002i/d\omega_i = 1/10000^{2i/d}. Low-frequency dimensions change slowly across positions (useful for representing coarse position), while high-frequency dimensions change rapidly (useful for fine-grained position distinctions). Together, the dd dimensions form a unique "fingerprint" for each position.

Proposition

Sinusoidal Encoding Enables Linear Position Offsets

Statement

For any fixed offset kk, the positional encoding at position pos+k\text{pos} + k is a linear function of the encoding at position pos\text{pos}. Specifically, for each frequency band ii:

(sin(ωi(pos+k))cos(ωi(pos+k)))=(cos(ωik)sin(ωik)sin(ωik)cos(ωik))(sin(ωipos)cos(ωipos))\begin{pmatrix} \sin(\omega_i(\text{pos}+k)) \\ \cos(\omega_i(\text{pos}+k)) \end{pmatrix} = \begin{pmatrix} \cos(\omega_i k) & \sin(\omega_i k) \\ -\sin(\omega_i k) & \cos(\omega_i k) \end{pmatrix} \begin{pmatrix} \sin(\omega_i \cdot \text{pos}) \\ \cos(\omega_i \cdot \text{pos}) \end{pmatrix}

where ωi=1/100002i/d\omega_i = 1/10000^{2i/d}.

The matrix is a rotation matrix RωikR_{\omega_i k}. So shifting position by kk is equivalent to rotating each 2D frequency component by ωik\omega_i k.

Intuition

The sinusoidal encoding represents position as a point on a collection of circles (one per frequency band). Moving forward by kk positions rotates the point on each circle by a frequency-dependent angle. Since rotation is a linear operation, the model can learn to detect relative position offsets using linear projections. no nonlinear computation needed.

Why It Matters

This linear relationship means that a linear attention head can, in principle, learn to attend based on relative position. The rotation structure of the sinusoidal encoding is the direct precursor to RoPE, which takes this idea further by applying rotations to queries and keys rather than adding vectors to embeddings.

Failure Mode

In practice, the additive sinusoidal encoding works poorly for long sequences. Because position information is mixed into the token embedding by addition, the model must disentangle content and position information. For long sequences, the position signal becomes a small perturbation on the content signal, making it hard for the model to use. Modern LLMs have abandoned sinusoidal encoding in favor of RoPE.

Learned Absolute Positions

Definition

Learned Positional Embedding

Instead of fixed sinusoidal vectors, learn a position embedding table EposRnmax×dE_{\text{pos}} \in \mathbb{R}^{n_{\max} \times d} where nmaxn_{\max} is the maximum sequence length. The input becomes:

X+Epos[1:n]X + E_{\text{pos}}[1:n]

where Epos[1:n]E_{\text{pos}}[1:n] selects the first nn rows for a sequence of length nn.

Learned positions are more flexible than sinusoidal (the model can learn any position pattern) but have a critical limitation: they cannot extrapolate beyond nmaxn_{\max}. Position 10001 has no embedding if the model was trained with nmax=10000n_{\max} = 10000. This makes context length extension impossible without retraining.

GPT-2 and the original GPT-3 used learned absolute positions. BERT also uses learned positions. Modern LLMs have moved away from this approach.

Rotary Position Embedding (RoPE)

Definition

Rotary Position Embedding (RoPE)

RoPE (Su et al., 2021) encodes position by rotating the query and key vectors rather than adding to the input embeddings.

For a query vector qq at position mm and key vector kk at position nn, RoPE applies position-dependent rotations:

q~m=Rmq,k~n=Rnk\tilde{q}_m = R_m q, \qquad \tilde{k}_n = R_n k

where RmR_m is a block-diagonal rotation matrix. In each 2D block (2i,2i+1)(2i, 2i+1):

Rm(i)=(cos(mθi)sin(mθi)sin(mθi)cos(mθi))R_m^{(i)} = \begin{pmatrix} \cos(m\theta_i) & -\sin(m\theta_i) \\ \sin(m\theta_i) & \cos(m\theta_i) \end{pmatrix}

with frequencies θi=100002i/d\theta_i = 10000^{-2i/d} (same base frequencies as sinusoidal encoding).

The full rotation matrix RmR_m is block-diagonal with d/2d/2 such 2×22 \times 2 blocks.

Theorem

RoPE Encodes Relative Position in Attention Scores

Statement

The attention logit between position mm (query) and position nn (key) under RoPE depends only on the relative distance mnm - n:

q~mk~n=qRmRnk=qRnmk\tilde{q}_m^\top \tilde{k}_n = q^\top R_m^\top R_n k = q^\top R_{n-m} k

This follows from the orthogonality of rotation matrices: RmRn=RnmR_m^\top R_n = R_{n-m}.

Explicitly, the attention logit decomposes into d/2d/2 terms, one per frequency band:

q~mk~n=i=0d/21[(q2ik2i+q2i+1k2i+1)cos((mn)θi)+(q2i+1k2iq2ik2i+1)sin((mn)θi)]\tilde{q}_m^\top \tilde{k}_n = \sum_{i=0}^{d/2-1} \left[(q_{2i} k_{2i} + q_{2i+1} k_{2i+1})\cos((m-n)\theta_i) + (q_{2i+1} k_{2i} - q_{2i} k_{2i+1})\sin((m-n)\theta_i)\right]

Each term depends on the relative position mnm - n and the frequency θi\theta_i.

Intuition

Rotating the query by angle mθm\theta and the key by angle nθn\theta, then taking their dot product, gives a result that depends on the angle difference (mn)θ(m-n)\theta. This is exactly how relative position should work: the attention between tokens 5 and 3 should be the same as between tokens 105 and 103, because the relative offset is the same.

Each frequency band captures a different scale of relative position: high-frequency bands distinguish nearby positions, low-frequency bands distinguish distant positions. Together, the d/2d/2 frequency bands provide a rich encoding of relative distance.

Proof Sketch

The rotation matrix RmR_m is orthogonal (RmRm=IR_m^\top R_m = I) and satisfies the group property RmRn=Rm+nR_m R_n = R_{m+n}.

Therefore: RmRn=RmRn=RnmR_m^\top R_n = R_{-m} R_n = R_{n-m}.

The dot product: q~mk~n=(Rmq)(Rnk)=qRmRnk=qRnmk\tilde{q}_m^\top \tilde{k}_n = (R_m q)^\top (R_n k) = q^\top R_m^\top R_n k = q^\top R_{n-m} k.

This depends on mm and nn only through nmn - m.

Expanding the 2D rotation for block ii: (Rnm(i))(R_{n-m}^{(i)}) rotates the 2D sub-vector by angle (nm)θi(n-m)\theta_i. The dot product of the rotated sub-vectors gives the cosine and sine terms in the explicit formula.

Why It Matters

RoPE is used in Llama, Mistral, Qwen, and virtually all modern open-source LLMs. GPT-NeoX-20B (Black et al. 2022, arXiv 2204.06745) was the first large-scale decoder-only LM to use RoPE, predating Llama. Its success comes from three properties: (1) it naturally encodes relative position without any additive modification to the embeddings, (2) base RoPE adds zero trainable parameters (variants like YaRN and learned RoPE do add small numbers of trainable parameters, so this property is not universal across the RoPE family), and (3) it composes cleanly with the attention mechanism because the position information lives in the phase of the query-key dot product rather than in the amplitude of the embeddings.

Production models often increase the RoPE base to improve long-context behavior: Llama 3 (Dubey et al. 2024) uses base b=500,000b = 500{,}000, and Mistral-7B uses b=1,000,000b = 1{,}000{,}000. A larger base stretches the low-frequency wavelengths so that longer absolute positions still correspond to angles seen in training.

Failure Mode

RoPE's context length extrapolation is limited. The base frequencies θi=100002i/d\theta_i = 10000^{-2i/d} span a wide range: θ0=1\theta_0 = 1 is the highest frequency (shortest period 2π2\pi), and θd/21=10000(d2)/d\theta_{d/2-1} = 10000^{-(d-2)/d} is the lowest frequency (longest period 2π/θd/212\pi / \theta_{d/2-1}, which for d=128d=128 is on the order of 6×1046 \times 10^4 positions). The real failure at test-time lengths beyond training is not literal wrap-around of angles past 2π2\pi, but distribution shift: the model encounters angle combinations (mθ0,mθ1,)(m\theta_0, m\theta_1, \ldots) that never appeared during training, and attention heads trained on one angular regime behave unpredictably on another. Techniques like NTK-aware scaling, YaRN (Peng 2023, arXiv 2309.00071), and dynamic NTK interpolation modify the frequency base to keep test-time angles inside the training distribution without full retraining.

ALiBi: Attention with Linear Biases

Proposition

ALiBi Provides Position via Attention Bias

Statement

ALiBi (Press et al., 2022) does not modify the input embeddings or the Q/K projections at all. Instead, it adds a position-dependent bias directly to the attention logits:

AttentionALiBi=softmax(QKdkλmn)V\text{Attention}_{\text{ALiBi}} = \text{softmax}\left(\frac{QK^\top}{\sqrt{d_k}} - \lambda \cdot |m - n|\right)V

where λ>0\lambda > 0 is a head-specific slope and mn|m - n| is the absolute distance between positions mm (query) and nn (key). Press et al. 2022 set the slopes as a geometric sequence starting at 28/H2^{-8/H} with common ratio 28/H2^{-8/H}: for head h{1,,H}h \in \{1, \ldots, H\}, the slope is mh=28h/Hm_h = 2^{-8h/H}.

The linear bias penalizes distant tokens, making attention prefer nearby positions. Different heads use different slopes, so some heads attend locally and others more globally.

Intuition

ALiBi says: "All else being equal, prefer to attend to nearby tokens." The linear penalty λmn\lambda|m-n| acts as a soft window. Tokens that are far away must have very high content similarity (large qkq^\top k) to overcome the distance penalty. This is a reasonable inductive bias: in natural language, nearby words are usually more relevant than distant ones.

Why It Matters

ALiBi was designed specifically for context length extrapolation. Because the position information is a simple additive bias rather than a modification to the embeddings, the scheme generalizes smoothly to positions never seen during training: position 20000 just gets a proportionally larger penalty. In the original perplexity experiments of Press et al. 2022, ALiBi extrapolated better than sinusoidal and learned positional encodings. Kazemnejad et al. 2023 (arXiv 2305.19466) later showed that the picture is task-dependent: on some length-generalization benchmarks (algorithmic tasks, certain reasoning tasks), ALiBi does not generalize better than other schemes, and in some cases no positional encoding at all outperforms ALiBi. Treat the "extrapolates" claim as "extrapolates on perplexity for natural language, sometimes fails elsewhere."

Failure Mode

ALiBi assumes that attention should decay with distance, which is not always true. Tasks requiring long-range dependencies (e.g., matching an opening bracket with a closing bracket 10000 tokens later) are penalized by the linear bias. RoPE does not have this bias: tokens at any distance can attend to each other based purely on content similarity. This is one reason RoPE has become more popular than ALiBi for general-purpose LLMs.

Context Length Extrapolation

A critical practical question: can a model trained on sequences of length LtrainL_{\text{train}} perform well on sequences of length Ltest>LtrainL_{\text{test}} > L_{\text{train}}?

  • Learned positions: No extrapolation. Positions beyond LtrainL_{\text{train}} have no embedding.
  • Sinusoidal: Theoretically extrapolates (the formula works for any position) but performs poorly in practice.
  • ALiBi: Good extrapolation for moderate extensions (2-4x) because the linear bias generalizes smoothly.
  • RoPE (vanilla, i.e., the original formulation without frequency scaling): Degrades beyond training length due to unseen rotation angles.
  • RoPE + frequency scaling (YaRN, NTK-aware): Extends RoPE by modifying the frequency base. NTK-aware scaling (bloc97, 2023) replaces the base b=10000b = 10000 with b=bsd/(d2)b' = b \cdot s^{d/(d-2)} for context-length scaling factor ss, so θi=(b)2i/d\theta_i = (b')^{-2i/d}. The d/(d2)d/(d-2) exponent is chosen so that the highest-frequency band (smallest ii) is nearly unchanged while lower-frequency bands stretch roughly by the target factor ss, keeping short-range position resolution intact while extending long-range reach. Naively setting b=bsb' = b \cdot s compresses high-frequency bands too aggressively and degrades local attention. YaRN (Peng et al. 2023) refines NTK-aware scaling further by splitting the frequency spectrum and scaling low-frequency bands while leaving high-frequency bands alone.

Modern long-context models (128K+ tokens) typically use RoPE with some form of frequency scaling, sometimes combined with continued pretraining on longer sequences.

Common Confusions

Watch Out

RoPE modifies Q and K, not the input embeddings

Sinusoidal and learned positional encodings are added to the input XX before any projections. RoPE is applied after the Q and K projections but before the dot product. This is a crucial distinction: RoPE does not pollute the residual stream with position information. The values VV are completely position-free in RoPE, which means the information flowing through residual connections is purely semantic.

Watch Out

RoPE and sinusoidal use the same frequencies for different purposes

Both use θi=100002i/d\theta_i = 10000^{-2i/d}. But sinusoidal encoding adds [sin(mθi),cos(mθi)][\sin(m\theta_i), \cos(m\theta_i)] to the embedding, while RoPE rotates the Q and K vectors by angle mθim\theta_i in each 2D subspace. The rotation approach is strictly better because it encodes relative position in the attention logit directly, without contaminating the embedding space.

Watch Out

ALiBi and RoPE are not easily comparable

ALiBi modifies attention logits; RoPE modifies Q and K vectors. They encode different inductive biases: ALiBi assumes attention should decay with distance; RoPE assumes only that attention should depend on relative position, not absolute position. Neither is strictly better; the choice depends on the use case. In practice, RoPE dominates because it is more expressive (the model can learn to attend to distant tokens when needed).

Summary

  • Attention is permutation-equivariant: without position encoding, transformers cannot represent word order
  • Sinusoidal: add [sin,cos][\sin, \cos] vectors at different frequencies to embeddings; linear offset relationship
  • Learned positions: flexible but cannot extrapolate beyond training length
  • RoPE: rotate Q and K by position-dependent angles; dot product depends on relative position mnm - n
  • RoPE key identity: RmRn=RnmR_m^\top R_n = R_{n-m} (orthogonal group property)
  • ALiBi: add linear distance penalty to attention logits with head slopes mh=28h/Hm_h = 2^{-8h/H}; good perplexity extrapolation on language modeling but length generalization is task-dependent (Kazemnejad et al. 2023)
  • RoPE is the default for modern LLMs (Llama, Mistral, Qwen)
  • Context length extension: modify RoPE frequency base (NTK-aware, YaRN) to stretch wavelengths

Exercises

ExerciseCore

Problem

For sinusoidal positional encoding with d=4d = 4 (so two frequency bands), compute the encoding vectors for positions 0, 1, and 2. Verify that the encoding at position 2 can be obtained from the encoding at position 0 by a linear transformation (rotation in each 2D subspace).

ExerciseAdvanced

Problem

Prove that RoPE attention scores depend only on relative position. Starting from q~m=Rmq\tilde{q}_m = R_m q and k~n=Rnk\tilde{k}_n = R_n k where RmR_m is a block-diagonal rotation matrix, show that q~mk~n=qRnmk\tilde{q}_m^\top \tilde{k}_n = q^\top R_{n-m} k and that this depends on mm and nn only through their difference.

ExerciseResearch

Problem

A model is trained with RoPE using base frequency b=10000b = 10000 and context length 4096. You want to extend it to 32768 tokens (8x) without retraining. The NTK-aware scaling approach (bloc97 2023) replaces bb with b=bsd/(d2)b' = b \cdot s^{d/(d-2)} for context-extension factor ss. For d=128d = 128 and s=8s = 8, compute bb' and show the effect on the frequency bands. Which frequency bands are barely changed and which are stretched most, and why is this the right direction (in contrast to the naive choice b=bsb' = b \cdot s)?

Related Comparisons

References

Canonical:

  • Vaswani et al., "Attention Is All You Need" (2017). sinusoidal positional encoding
  • Su et al., "RoFormer: Enhanced Transformer with Rotary Position Embedding" (2021). RoPE

Current:

  • Press, Smith, Lewis, "Train Short, Test Long: Attention with Linear Biases Enables Input Length Extrapolation" (2022), arXiv 2108.12409. ALiBi. Head slopes mh=28h/Hm_h = 2^{-8h/H}.
  • Kazemnejad et al., "The Impact of Positional Encoding on Length Generalization in Transformers" (2023), arXiv 2305.19466. Shows ALiBi length generalization is task-dependent.
  • Peng et al., "YaRN: Efficient Context Window Extension of Large Language Models" (2023), arXiv 2309.00071.
  • bloc97, "NTK-Aware Scaled RoPE" (2023). NTK-aware interpolation. Base rescaling b=bsd/(d2)b' = b \cdot s^{d/(d-2)}.
  • Black et al., "GPT-NeoX-20B: An Open-Source Autoregressive Language Model" (2022), arXiv 2204.06745. First large-scale decoder-only LM to use RoPE.
  • Dubey et al., "The Llama 3 Herd of Models" (2024). RoPE base 500,000500{,}000 for long-context training.
  • Jiang et al., "Mistral 7B" (2023), arXiv 2310.06825. RoPE base 1,000,0001{,}000{,}000.

Next Topics

Positional encoding connects to:

  • Attention mechanism theory: the attention operation that positional encoding modifies
  • KV cache: how position encoding interacts with cached key-value pairs during generation

Last reviewed: April 2026

Prerequisites

Foundations this topic depends on.