Positional Encoding

Sneiderman, Robby

LLM Construction

Positional Encoding

Why attention needs position information, sinusoidal encoding, learned positions, RoPE (rotary position encoding via 2D rotations), ALiBi, and why RoPE became the default for modern LLMs.

AdvancedTier 3CurrentSupporting~50 min

Prerequisites

Attention Mechanism Theory Attention Is All You Need Paper Attention Mechanisms History

Prereq Map

Learning position

Read this page in the graph.

llm-construction | layer 4 | tier 3. This page has 3 direct prerequisites and 0 published dependents.

Open Atlas Prerequisites Leads to

What next

Take the diagnostic

No published continuation is declared yet, so the diagnostic is the clean next route.

Evidence badge

Claim status

This page has no public Lean mapping yet. Use the evidence page to inspect how claim status labels work.

Show the backing system

AtlasOpen the full prerequisite graph and run grounding traces.EvidenceInspect source support, claim labels, and public trust status.LeanReview the checked declaration list, scopes, and axiom profile.

Why This Matters

Self-attention is permutation-equivariant: it treats the input as a set, not a sequence. Without positional information, a transformer cannot distinguish "the dog bit the man" from "the man bit the dog". both produce structurally identical attention patterns. Positional encoding is how transformers learn about word order, and the choice of encoding scheme has profound effects on context length generalization, training stability, and model quality.

This topic covers the mathematical theory behind the major positional encoding schemes, from the original sinusoidal encoding to RoPE (which is now the de facto standard) and ALiBi.

Mental Model

Imagine each token in a sequence wears a jersey with its position number. The attention mechanism needs to see these jerseys to know who came first, second, third. There are three structurally different ways to assign jerseys:

Absolute position (sinusoidal/learned): Add a position-specific vector directly to each token embedding. Position 5 always gets the same vector, regardless of context.
Relative position (RoPE): Encode positions so that the attention score between tokens depends only on their relative distance, not their absolute positions.
Attention bias (ALiBi): Do not modify embeddings at all. Instead, add a distance-dependent penalty directly to the attention logits.

Why Position Information Is Needed

Definition

Permutation Equivariance of Attention

For a permutation matrix $P$ , self-attention satisfies:

$\text{Attention}(PX) = P \cdot \text{Attention}(X)$

This means that permuting the input tokens permutes the output tokens in the same way. The attention operation itself encodes no information about which token came first. It treats position 1 and position 1000 identically.

Without positional encoding, a transformer is a function on multisets, not sequences. It could learn that "cat" and "sat" appear together but not that "cat" precedes "sat." For language (where word order is essential for meaning) and for autoregressive generation (where the model must predict the next token specifically), position information is mandatory.

Sinusoidal Positional Encoding

Definition

Sinusoidal Positional Encoding

Vaswani et al. (2017) proposed adding fixed sinusoidal vectors to the input embeddings. For position $\text{pos}$ and dimension $i$ :

$\text{PE}(\text{pos}, 2i) = \sin\left(\frac{\text{pos}}{10000^{2i/d}}\right)$ $\text{PE}(\text{pos}, 2i+1) = \cos\left(\frac{\text{pos}}{10000^{2i/d}}\right)$

The input to the transformer becomes $X + \text{PE}$ , where each row of $\text{PE}$ is the positional encoding for that position.

Why sinusoids? Each pair of dimensions $(2i, 2i+1)$ oscillates at a different frequency $\omega_i = 1/10000^{2i/d}$ . Low-frequency dimensions change slowly across positions (useful for representing coarse position), while high-frequency dimensions change rapidly (useful for fine-grained position distinctions). Together, the $d$ dimensions form a unique "fingerprint" for each position.

Proposition

Sinusoidal Encoding Enables Linear Position Offsets

Statement

For any fixed offset $k$ , the positional encoding at position $\text{pos} + k$ is a linear function of the encoding at position $\text{pos}$ . Specifically, for each frequency band $i$ :

$\begin{pmatrix} \sin(\omega_i(\text{pos}+k)) \\ \cos(\omega_i(\text{pos}+k)) \end{pmatrix} = \begin{pmatrix} \cos(\omega_i k) & \sin(\omega_i k) \\ -\sin(\omega_i k) & \cos(\omega_i k) \end{pmatrix} \begin{pmatrix} \sin(\omega_i \cdot \text{pos}) \\ \cos(\omega_i \cdot \text{pos}) \end{pmatrix}$

where $\omega_i = 1/10000^{2i/d}$ .

The matrix is a rotation matrix $R_{\omega_i k}$ . So shifting position by $k$ is equivalent to rotating each 2D frequency component by $\omega_i k$ .

Intuition

The sinusoidal encoding represents position as a point on a collection of circles (one per frequency band). Moving forward by $k$ positions rotates the point on each circle by a frequency-dependent angle. Since rotation is a linear operation, the model can learn to detect relative position offsets using linear projections. no nonlinear computation needed.

Why It Matters

This linear relationship means that a linear attention head can, in principle, learn to attend based on relative position. The rotation structure of the sinusoidal encoding is the direct precursor to RoPE, which takes this idea further by applying rotations to queries and keys rather than adding vectors to embeddings.

Failure Mode

In practice, the additive sinusoidal encoding works poorly for long sequences. Because position information is mixed into the token embedding by addition, the model must disentangle content and position information. For long sequences, the position signal becomes a small perturbation on the content signal, making it hard for the model to use. Modern LLMs have abandoned sinusoidal encoding in favor of RoPE.

report a correction →

Learned Absolute Positions

Definition

Learned Positional Embedding

Instead of fixed sinusoidal vectors, learn a position embedding table $E_{\text{pos}} \in \mathbb{R}^{n_{\max} \times d}$ where $n_{\max}$ is the maximum sequence length. The input becomes:

$X + E_{\text{pos}}[1:n]$

where $E_{\text{pos}}[1:n]$ selects the first $n$ rows for a sequence of length $n$ .

Learned positions are more flexible than sinusoidal (the model can learn any position pattern) but have a critical limitation: they cannot extrapolate beyond $n_{\max}$ . Position 10001 has no embedding if the model was trained with $n_{\max} = 10000$ . This makes context length extension impossible without retraining.

GPT-2 and the original GPT-3 used learned absolute positions. BERT also uses learned positions. Modern LLMs have moved away from this approach.

Rotary Position Embedding (RoPE)

Definition

Rotary Position Embedding (RoPE)

RoPE (Su et al., 2021) encodes position by rotating the query and key vectors rather than adding to the input embeddings.

For a query vector $q$ at position $m$ and key vector $k$ at position $n$ , RoPE applies position-dependent rotations:

$\tilde{q}_m = R_m q, \qquad \tilde{k}_n = R_n k$

where $R_m$ is a block-diagonal rotation matrix. In each 2D block $(2i, 2i+1)$ :

$R_m^{(i)} = \begin{pmatrix} \cos(m\theta_i) & -\sin(m\theta_i) \\ \sin(m\theta_i) & \cos(m\theta_i) \end{pmatrix}$

with frequencies $\theta_i = 10000^{-2i/d}$ (same base frequencies as sinusoidal encoding).

The full rotation matrix $R_m$ is block-diagonal with $d/2$ such $2 \times 2$ blocks.

Theorem

RoPE Encodes Relative Position in Attention Scores

Statement

The attention logit between position $m$ (query) and position $n$ (key) under RoPE depends only on the relative distance $m - n$ :

$\tilde{q}_m^\top \tilde{k}_n = q^\top R_m^\top R_n k = q^\top R_{n-m} k$

This follows from the orthogonality of rotation matrices: $R_m^\top R_n = R_{n-m}$ .

Explicitly, the attention logit decomposes into $d/2$ terms, one per frequency band:

$\tilde{q}_m^\top \tilde{k}_n = \sum_{i=0}^{d/2-1} \left[(q_{2i} k_{2i} + q_{2i+1} k_{2i+1})\cos((n-m)\theta_i) + (q_{2i+1} k_{2i} - q_{2i} k_{2i+1})\sin((n-m)\theta_i)\right]$

Each term depends on the relative position $n - m$ and the frequency $\theta_i$ .

Intuition

Rotating the query by angle $m\theta$ and the key by angle $n\theta$ , then taking their dot product, gives a result that depends on the angle difference $(n-m)\theta$ . This is exactly how relative position should work: the attention between tokens 5 and 3 should be the same as between tokens 105 and 103, because the relative offset is the same.

Each frequency band captures a different scale of relative position: high-frequency bands distinguish nearby positions, low-frequency bands distinguish distant positions. Together, the $d/2$ frequency bands provide a rich encoding of relative distance.

Proof Sketch

The rotation matrix $R_m$ is orthogonal ( $R_m^\top R_m = I$ ) and satisfies the group property $R_m R_n = R_{m+n}$ .

Therefore: $R_m^\top R_n = R_{-m} R_n = R_{n-m}$ .

The dot product: $\tilde{q}_m^\top \tilde{k}_n = (R_m q)^\top (R_n k) = q^\top R_m^\top R_n k = q^\top R_{n-m} k$ .

This depends on $m$ and $n$ only through $n - m$ .

Expanding the 2D rotation for block $i$ : $(R_{n-m}^{(i)})$ rotates the 2D sub-vector by angle $(n-m)\theta_i$ . The dot product of the rotated sub-vectors gives the cosine and sine terms in the explicit formula.

Why It Matters

RoPE is used in Llama, Mistral, Qwen, and virtually all modern open-source LLMs. GPT-NeoX-20B (Black et al. 2022, arXiv 2204.06745) was the first large-scale decoder-only LM to use RoPE, predating Llama. Its success comes from three properties: (1) it naturally encodes relative position without any additive modification to the embeddings, (2) base RoPE adds zero trainable parameters (variants like YaRN and learned RoPE do add small numbers of trainable parameters, so this property is not universal across the RoPE family), and (3) it composes cleanly with the attention mechanism because the position information lives in the phase of the query-key dot product rather than in the amplitude of the embeddings.

Production models often increase the RoPE base to improve long-context behavior: Llama 3 (Dubey et al. 2024) uses base $b = 500{,}000$ , and Mistral-7B uses $b = 1{,}000{,}000$ . A larger base stretches the low-frequency wavelengths so that longer absolute positions still correspond to angles seen in training.

Failure Mode

RoPE's context length extrapolation is limited. The base frequencies $\theta_i = 10000^{-2i/d}$ span a wide range: $\theta_0 = 1$ is the highest frequency (shortest period $2\pi$ ), and $\theta_{d/2-1} = 10000^{-(d-2)/d}$ is the lowest frequency (longest period $2\pi / \theta_{d/2-1}$ , which for $d=128$ is on the order of $6 \times 10^4$ positions). The real failure at test-time lengths beyond training is not literal wrap-around of angles past $2\pi$ , but distribution shift: the model encounters angle combinations $(m\theta_0, m\theta_1, \ldots)$ that never appeared during training, and attention heads trained on one angular regime behave unpredictably on another. Techniques like NTK-aware scaling, YaRN (Peng 2023, arXiv 2309.00071), and dynamic NTK interpolation modify the frequency base to keep test-time angles inside the training distribution without full retraining.

report a correction →

ALiBi: Attention with Linear Biases

Proposition

ALiBi Provides Position via Attention Bias

Statement

ALiBi (Press et al., 2022) does not modify the input embeddings or the Q/K projections at all. Instead, it adds a position-dependent bias directly to the attention logits:

$\text{Attention}_{\text{ALiBi}} = \text{softmax}\left(\frac{QK^\top}{\sqrt{d_k}} - \lambda \cdot |m - n|\right)V$

where $\lambda > 0$ is a head-specific slope and $|m - n|$ is the absolute distance between positions $m$ (query) and $n$ (key). Press et al. 2022 set the slopes as a geometric sequence starting at $2^{-8/H}$ with common ratio $2^{-8/H}$ : for head $h \in \{1, \ldots, H\}$ , the slope is $m_h = 2^{-8h/H}$ .

The linear bias penalizes distant tokens, making attention prefer nearby positions. Different heads use different slopes, so some heads attend locally and others more globally.

Intuition

ALiBi says: "All else being equal, prefer to attend to nearby tokens." The linear penalty $\lambda|m-n|$ acts as a soft window. Tokens that are far away must have very high content similarity (large $q^\top k$ ) to overcome the distance penalty. This is a reasonable inductive bias: in natural language, nearby words are usually more relevant than distant ones.

Why It Matters

ALiBi was designed specifically for context length extrapolation. Because the position information is a simple additive bias rather than a modification to the embeddings, the scheme generalizes smoothly to positions never seen during training: position 20000 just gets a proportionally larger penalty. In the original perplexity experiments of Press et al. 2022, ALiBi extrapolated better than sinusoidal and learned positional encodings. Kazemnejad et al. 2023 (arXiv 2305.19466) later showed that the picture is task-dependent: on some length-generalization benchmarks (algorithmic tasks, certain reasoning tasks), ALiBi does not generalize better than other schemes, and in some cases no positional encoding at all outperforms ALiBi. Treat the "extrapolates" claim as "extrapolates on perplexity for natural language, sometimes fails elsewhere."

Failure Mode

ALiBi assumes that attention should decay with distance, which is not always true. Tasks requiring long-range dependencies (e.g., matching an opening bracket with a closing bracket 10000 tokens later) are penalized by the linear bias. RoPE does not have this bias: tokens at any distance can attend to each other based purely on content similarity. This is one reason RoPE has become more popular than ALiBi for general-purpose LLMs.

report a correction →

Other Relative Position Schemes

Three schemes sit alongside RoPE and ALiBi in the relative-position family. They are worth knowing because T5 and its descendants still use the bucketed bias, and xPos appears in long-context RoPE variants.

Shaw et al. 2018 (arXiv 1803.02155) introduced the first major relative-position scheme for transformers. For each head, learnable embeddings $a^K_{i-j}, a^V_{i-j} \in \mathbb{R}^{d_k}$ are added to the keys and values within a clipped distance window $|i - j| \le k$ , with all offsets beyond $k$ sharing the boundary embedding. The attention logit becomes $q_i^\top (k_j + a^K_{i-j}) / \sqrt{d_k}$ and the value aggregation includes $a^V_{i-j}$ . This is the direct precursor to both T5 relative bias and RoPE.

T5 relative bias (Raffel et al. 2020, arXiv 1910.10683 Appendix D) simplifies Shaw by dropping the value bias and compressing the key bias to a scalar. For each head it learns a scalar $b_{B(i-j)}$ indexed by a log-distance bucket: the raw distance $|i - j|$ is mapped into a fixed set of buckets (linear for small distances, logarithmic for large ones), and the bias is added directly to the attention logit $q_i^\top k_j / \sqrt{d_k} + b_{B(i-j)}$ . This is parameter-efficient (one scalar per bucket per head rather than a full $d_k$ -vector per offset) and is used in T5, mT5, and UL2.

xPos (Sun et al. 2022, arXiv 2212.10554) augments RoPE with an exponential decay factor for better length extrapolation. Each 2D rotation block is multiplied by $\zeta_i^m$ on the query side and $\zeta_i^{-n}$ on the key side (where $\zeta_i \in (0, 1)$ is a per-band decay rate), so the effective inner product carries a factor $\zeta_i^{m-n}$ that dampens distant attention while preserving RoPE's relative-position property. This makes out-of-distribution long-context behavior more stable and is used in subsequent long-context work such as LongNet.

Context Length Extrapolation

A critical practical question: can a model trained on sequences of length $L_{\text{train}}$ perform well on sequences of length $L_{\text{test}} > L_{\text{train}}$ ?

Learned positions: No extrapolation. Positions beyond $L_{\text{train}}$ have no embedding.
Sinusoidal: Theoretically extrapolates (the formula works for any position) but performs poorly in practice.
ALiBi: Good extrapolation for moderate extensions (2-4x) because the linear bias generalizes smoothly.
RoPE (vanilla, i.e., the original formulation without frequency scaling): Degrades beyond training length due to unseen rotation angles.
RoPE + frequency scaling (YaRN, NTK-aware): Extends RoPE by modifying the frequency base. NTK-aware scaling (bloc97, 2023) replaces the base $b = 10000$ with $b' = b \cdot s^{d/(d-2)}$ for context-length scaling factor $s$ , so $\theta_i = (b')^{-2i/d}$ . The $d/(d-2)$ exponent is chosen so that the highest-frequency band (smallest $i$ ) is nearly unchanged while lower-frequency bands stretch roughly by the target factor $s$ , keeping short-range position resolution intact while extending long-range reach. Naively setting $b' = b \cdot s$ compresses high-frequency bands too aggressively and degrades local attention. YaRN (Peng et al. 2023) refines NTK-aware scaling further by splitting the frequency spectrum and scaling low-frequency bands while leaving high-frequency bands alone.

Modern long-context models (128K+ tokens) typically use RoPE with some form of frequency scaling, sometimes combined with continued pretraining on longer sequences.

Common Confusions

Watch Out

RoPE modifies Q and K, not the input embeddings

Sinusoidal and learned positional encodings are added to the input $X$ before any projections. RoPE is applied after the Q and K projections but before the dot product. This is a crucial distinction: RoPE does not pollute the residual stream with position information. The values $V$ are completely position-free in RoPE, which means the information flowing through residual connections is purely semantic.

Watch Out

RoPE and sinusoidal use the same frequencies for different purposes

Both use $\theta_i = 10000^{-2i/d}$ . But sinusoidal encoding adds $[\sin(m\theta_i), \cos(m\theta_i)]$ to the embedding, while RoPE rotates the Q and K vectors by angle $m\theta_i$ in each 2D subspace. The rotation approach is strictly better because it encodes relative position in the attention logit directly, without contaminating the embedding space.

Watch Out

ALiBi and RoPE are not easily comparable

ALiBi modifies attention logits; RoPE modifies Q and K vectors. They encode different inductive biases: ALiBi assumes attention should decay with distance; RoPE assumes only that attention should depend on relative position, not absolute position. Neither is strictly better; the choice depends on the use case. In practice, RoPE dominates because it is more expressive (the model can learn to attend to distant tokens when needed).

Summary

Attention is permutation-equivariant: without position encoding, transformers cannot represent word order
Sinusoidal: add $[\sin, \cos]$ vectors at different frequencies to embeddings; linear offset relationship
Learned positions: flexible but cannot extrapolate beyond training length
RoPE: rotate Q and K by position-dependent angles; dot product depends on relative position $m - n$
RoPE key identity: $R_m^\top R_n = R_{n-m}$ (orthogonal group property)
ALiBi: add linear distance penalty to attention logits with head slopes $m_h = 2^{-8h/H}$ ; good perplexity extrapolation on language modeling but length generalization is task-dependent (Kazemnejad et al. 2023)
RoPE is the default for modern LLMs (Llama, Mistral, Qwen)
Context length extension: modify RoPE frequency base (NTK-aware, YaRN) to stretch wavelengths

Exercises

ExerciseCore

Problem

For sinusoidal positional encoding with $d = 4$ (so two frequency bands), compute the encoding vectors for positions 0, 1, and 2. Verify that the encoding at position 2 can be obtained from the encoding at position 0 by a linear transformation (rotation in each 2D subspace).

ExerciseAdvanced

Problem

Prove that RoPE attention scores depend only on relative position. Starting from $\tilde{q}_m = R_m q$ and $\tilde{k}_n = R_n k$ where $R_m$ is a block-diagonal rotation matrix, show that $\tilde{q}_m^\top \tilde{k}_n = q^\top R_{n-m} k$ and that this depends on $m$ and $n$ only through their difference.

ExerciseResearch

Problem

A model is trained with RoPE using base frequency $b = 10000$ and context length 4096. You want to extend it to 32768 tokens (8x) without retraining. The NTK-aware scaling approach (bloc97 2023) replaces $b$ with $b' = b \cdot s^{d/(d-2)}$ for context-extension factor $s$ . For $d = 128$ and $s = 8$ , compute $b'$ and show the effect on the frequency bands. Which frequency bands are barely changed and which are stretched most, and why is this the right direction (in contrast to the naive choice $b' = b \cdot s$ )?

Related Comparisons

RoPE vs. ALiBi vs. Sinusoidal Positional Encoding

References

Canonical:

Vaswani et al., "Attention Is All You Need" (2017). sinusoidal positional encoding
Su et al., "RoFormer: Enhanced Transformer with Rotary Position Embedding" (2021). RoPE

Current:

Shaw, Uszkoreit, Vaswani, "Self-Attention with Relative Position Representations" (2018), arXiv 1803.02155. First major relative-position scheme: learnable bias on keys and values within clipped distance window.
Raffel et al., "Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer" (2020), arXiv 1910.10683 Appendix D. T5 bucketed log-distance relative position bias on attention logits.
Sun et al., "A Length-Extrapolatable Transformer" (2022), arXiv 2212.10554. xPos: augments RoPE with per-band exponential decay $\zeta_i^{m-n}$ for better out-of-distribution length behavior.
Press, Smith, Lewis, "Train Short, Test Long: Attention with Linear Biases Enables Input Length Extrapolation" (2022), arXiv 2108.12409. ALiBi. Head slopes $m_h = 2^{-8h/H}$ .
Kazemnejad et al., "The Impact of Positional Encoding on Length Generalization in Transformers" (2023), arXiv 2305.19466. Shows ALiBi length generalization is task-dependent.
Chen et al., "Extending Context Window of Large Language Models via Positional Interpolation" (2023), arXiv 2306.15595. Position Interpolation (PI): the linear-base-scaling precursor to NTK-aware and YaRN.
Peng et al., "YaRN: Efficient Context Window Extension of Large Language Models" (2023), arXiv 2309.00071.
bloc97, "NTK-Aware Scaled RoPE" (2023). NTK-aware interpolation. Base rescaling $b' = b \cdot s^{d/(d-2)}$ .
Ding et al., "LongRoPE: Extending LLM Context Window Beyond 2 Million Tokens" (2024), arXiv 2402.13753. Evolutionary search over RoPE frequency rescaling.
Barbero et al., "Round and Round We Go! What makes Rotary Positional Encodings useful?" (2024), arXiv 2410.06205. Theoretical analysis of RoPE.
Black et al., "GPT-NeoX-20B: An Open-Source Autoregressive Language Model" (2022), arXiv 2204.06745. First large-scale decoder-only LM to use RoPE.
Dubey et al., "The Llama 3 Herd of Models" (2024). RoPE base $500{,}000$ for long-context training.
Jiang et al., "Mistral 7B" (2023), arXiv 2310.06825. RoPE base $1{,}000{,}000$ .

Next Topics

Positional encoding connects to:

Attention mechanism theory: the attention operation that positional encoding modifies
KV cache: how position encoding interacts with cached key-value pairs during generation

Last reviewed: April 18, 2026

Canonical graph

Required before and derived from this topic

These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.

Full prerequisite chain All derived topics

Required prerequisites

3

Attention Is All You Need (Paper)layer 4 · tier 1
Attention Mechanisms Historylayer 3 · tier 2
Attention Mechanism Theorylayer 4 · tier 2

Derived topics

0

No published topic currently declares this as a prerequisite.