RoPE vs ALiBi vs Sinusoidal

For the full treatment of each method (derivations, failure modes, long-context extensions), see positional encoding.

What Each Does

All three methods inject position information into transformers that are otherwise permutation-invariant. They differ in where and how position enters the computation.

Sinusoidal (Vaswani et al., 2017) adds a fixed, non-learned vector to each token embedding before attention. For position $t$ and dimension $i$ :

$\text{PE}(t, 2i) = \sin\left(\frac{t}{10000^{2i/d}}\right), \quad \text{PE}(t, 2i+1) = \cos\left(\frac{t}{10000^{2i/d}}\right)$

These vectors are added to the token embeddings: $x_t' = x_t + \text{PE}(t)$ . The design ensures that $\text{PE}(t+k)$ can be expressed as a linear function of $\text{PE}(t)$ , encoding relative position implicitly.

RoPE (Su et al., 2021) applies a rotation matrix to the query and key vectors rather than adding to embeddings. For a pair of dimensions $(q_{2i}, q_{2i+1})$ at position $t$ :

$\begin{pmatrix} q_{2i}' \\ q_{2i+1}' \end{pmatrix} = \begin{pmatrix} \cos(t\omega_i) & -\sin(t\omega_i) \\ \sin(t\omega_i) & \cos(t\omega_i) \end{pmatrix} \begin{pmatrix} q_{2i} \\ q_{2i+1} \end{pmatrix}$

where $\omega_i = 10000^{-2i/d}$ . The same rotation is applied to keys. The attention score $q'^T k'$ then depends only on the relative position $s - t$ because the rotation matrices combine: $R(t)^T R(s) = R(s-t)$ (using $R(a)^T = R(-a)$ and additivity of 2D rotations).

ALiBi (Press et al., 2022) adds no positional information to embeddings or to queries/keys. Instead, it adds a linear bias directly to the attention logits:

$\text{score}(t, s) = q_t^T k_s - m \cdot |t - s|$

where $m$ is a head-specific slope. Each attention head uses a different slope from a geometric sequence, so different heads attend at different distance scales. The penalty increases linearly with distance, creating a soft locality prior.

Side-by-Side Comparison

Property	Sinusoidal	Learned	RoPE	ALiBi
Where position enters	Added to token embeddings	Added to token embeddings	Applied to Q, K via rotation	Added to attention logits
Position type	Absolute (with implicit relative)	Absolute	Relative (via rotation algebra)	Relative (via distance penalty)
Learned parameters	None (can be learned variant)	$L \cdot d$ (one vector per position up to max $L$ )	None (base frequency is fixed)	None (slopes are fixed)
Effect on attention	Indirect (through modified embeddings)	Indirect (through modified embeddings)	Direct (Q/K dot product encodes distance)	Direct (linear penalty on distance)
Length extrapolation	Poor beyond training length	Catastrophic (no embedding beyond $L$ )	Good with NTK-aware interpolation	Good natively
Computational overhead	Negligible (one-time addition)	Negligible (table lookup)	Small (rotation per layer)	Negligible (bias addition)
Used in	Original Transformer	BERT, GPT-2, RoBERTa	LLaMA, Gemma, Mistral, Qwen, GPT-NeoX	BLOOM, MPT
Year introduced	2017	2018	2021	2022

Why RoPE Dominates Modern LLMs

RoPE encodes relative position through the geometry of the dot product rather than through additive embeddings. This has several advantages. First, relative position information is injected at every layer (rotations are applied per-layer to Q and K), not just at the input. Second, the rotation formulation extends naturally to longer sequences through frequency interpolation techniques.

The original RoPE struggles beyond its training context length because high-frequency components oscillate too rapidly at unseen positions. Position interpolation (Chen et al., 2023) and NTK-aware scaling (bloc97, 2023) address this by rescaling the frequency basis. YaRN (Peng et al., 2023) combines interpolation with attention temperature scaling, enabling LLaMA-2 to extrapolate from 4K to 128K tokens. These extensions are why virtually all modern open-weight LLMs (LLaMA 3, Gemma 2, Mistral, Qwen 2) use RoPE.

Why ALiBi is Simpler but Less Common

ALiBi requires no modification to the embedding layer or to the Q/K computation. It adds a single bias term to the attention matrix, making implementation trivial. The linear distance penalty acts as a soft window: nearby tokens receive higher attention, distant tokens are penalized. Different heads with different slopes create a multi-scale attention pattern.

ALiBi extrapolates well because the linear penalty is defined for all distances, not just those seen during training. However, ALiBi's fixed linear decay cannot capture the complex position-dependent patterns that RoPE's frequency decomposition enables. In practice, models using ALiBi (BLOOM, MPT) have been largely superseded by RoPE-based models in both scale and performance.

Length Extrapolation

The central practical question is: can the model handle sequences longer than those seen during training?

Sinusoidal encodings are defined for all positions, but the model has never seen the high-position vectors during training. Attention patterns learned at positions 0-512 do not transfer reliably to position 8192. Extrapolation fails.

RoPE with raw frequencies also degrades beyond the training length. But the rotation structure enables systematic fixes. Interpolation methods compress the effective position indices to fit within the trained range while preserving relative position information. This has been extensively validated: LLaMA-based models routinely extend from 4K to 100K+ tokens using RoPE interpolation plus continued fine-tuning.

ALiBi extrapolates natively because the linear bias function is defined for all distances and the model never sees absolute position information. The penalty at distance $d$ is always $-md$ , whether $d = 100$ or $d = 100000$ . This works well for moderate extrapolation but the linear assumption limits flexibility.

Common Confusions

Watch Out

Sinusoidal encoding does not directly encode relative position

The original paper noted that $\text{PE}(t+k)$ is a linear function of $\text{PE}(t)$ , suggesting the model could learn relative position. In practice, this linear relationship must be extracted by the attention layers, and learned positional embeddings (BERT, GPT-2) often outperform fixed sinusoidal encodings. RoPE makes relative position explicit in the attention computation.

Watch Out

RoPE is not applied to the value vectors

RoPE rotates only the query and key vectors. The value vectors are not modified by position. This is by design: the attention weights (computed from Q and K) determine how values are mixed. Position affects which tokens attend to which, not what information is extracted.

Watch Out

ALiBi does not prevent long-range attention

The linear penalty reduces but does not eliminate attention to distant tokens. A strong content-based match ( $q^T k$ is large) can overcome the distance penalty. ALiBi introduces a bias toward locality, not a hard cutoff. This is conceptually similar to a learned sliding window but with soft boundaries.

Watch Out

Learned positional embeddings are a fourth option

BERT and GPT-2 use learned absolute positional embeddings (a lookup table of vectors added to the input). These are conceptually simpler than sinusoidal encodings but cannot extrapolate at all because unseen positions have no embedding. They are largely historical for modern autoregressive models but still common in encoder-only architectures.

References

Vaswani, A. et al. (2017). "Attention Is All You Need." NeurIPS 2017. Section 3.5 (Sinusoidal positional encoding).
Su, J. et al. (2021). "RoFormer: Enhanced Transformer with Rotary Position Embedding." arXiv:2104.09864. (RoPE derivation and properties.)
Press, O. et al. (2022). "Train Short, Test Long: Attention with Linear Biases Enables Input Length Extrapolation." ICLR 2022. (ALiBi.)
Chen, S. et al. (2023). "Extending Context Window of Large Language Models via Positional Interpolation." arXiv:2306.15595. (Position interpolation for RoPE.)
Peng, B. et al. (2023). "YaRN: Efficient Context Window Extension of Large Language Models." arXiv:2309.00071. (NTK-aware interpolation and attention scaling for RoPE.)
Touvron, H. et al. (2023). "Llama 2: Open Foundation and Fine-Tuned Chat Models." arXiv:2307.09288. Section 2 (RoPE usage in LLaMA architecture.)
Workshop, BigScience et al. (2023). "BLOOM: A 176B-Parameter Open-Access Multilingual Language Model." arXiv:2211.05100. (ALiBi usage in BLOOM.)