What Each Does
All three methods inject position information into transformers that are otherwise permutation-invariant. They differ in where and how position enters the computation.
Sinusoidal (Vaswani et al., 2017) adds a fixed, non-learned vector to each token embedding before attention. For position and dimension :
These vectors are added to the token embeddings: . The design ensures that can be expressed as a linear function of , encoding relative position implicitly.
RoPE (Su et al., 2021) applies a rotation matrix to the query and key vectors rather than adding to embeddings. For a pair of dimensions at position :
where . The same rotation is applied to keys. The attention score then depends only on the relative position because the rotation matrices combine: .
ALiBi (Press et al., 2022) adds no positional information to embeddings or to queries/keys. Instead, it adds a linear bias directly to the attention logits:
where is a head-specific slope. Each attention head uses a different slope from a geometric sequence, so different heads attend at different distance scales. The penalty increases linearly with distance, creating a soft locality prior.
Side-by-Side Comparison
| Property | Sinusoidal | RoPE | ALiBi |
|---|---|---|---|
| Where position enters | Added to token embeddings | Applied to Q, K via rotation | Added to attention logits |
| Position type | Absolute (with implicit relative) | Relative (via rotation algebra) | Relative (via distance penalty) |
| Learned parameters | None (can be learned variant) | None (base frequency is fixed) | None (slopes are fixed) |
| Effect on attention | Indirect (through modified embeddings) | Direct (Q/K dot product encodes distance) | Direct (linear penalty on distance) |
| Length extrapolation | Poor beyond training length | Good with NTK-aware interpolation | Good natively |
| Computational overhead | Negligible (one-time addition) | Small (rotation per layer) | Negligible (bias addition) |
| Used in | Original Transformer, BERT | LLaMA, Gemma, Mistral, Qwen, GPT-NeoX | BLOOM, MPT |
| Year introduced | 2017 | 2021 | 2022 |
Why RoPE Dominates Modern LLMs
RoPE encodes relative position through the geometry of the dot product rather than through additive embeddings. This has several advantages. First, relative position information is injected at every layer (rotations are applied per-layer to Q and K), not just at the input. Second, the rotation formulation extends naturally to longer sequences through frequency interpolation techniques.
The original RoPE struggles beyond its training context length because high-frequency components oscillate too rapidly at unseen positions. Position interpolation (Chen et al., 2023) and NTK-aware scaling (bloc97, 2023) address this by rescaling the frequency basis. YaRN (Peng et al., 2023) combines interpolation with attention temperature scaling, enabling LLaMA-2 to extrapolate from 4K to 128K tokens. These extensions are why virtually all modern open-weight LLMs (LLaMA 3, Gemma 2, Mistral, Qwen 2) use RoPE.
Why ALiBi is Simpler but Less Common
ALiBi requires no modification to the embedding layer or to the Q/K computation. It adds a single bias term to the attention matrix, making implementation trivial. The linear distance penalty acts as a soft window: nearby tokens receive higher attention, distant tokens are penalized. Different heads with different slopes create a multi-scale attention pattern.
ALiBi extrapolates well because the linear penalty is defined for all distances, not just those seen during training. However, ALiBi's fixed linear decay cannot capture the complex position-dependent patterns that RoPE's frequency decomposition enables. In practice, models using ALiBi (BLOOM, MPT) have been largely superseded by RoPE-based models in both scale and performance.
Length Extrapolation
The central practical question is: can the model handle sequences longer than those seen during training?
Sinusoidal encodings are defined for all positions, but the model has never seen the high-position vectors during training. Attention patterns learned at positions 0-512 do not transfer reliably to position 8192. Extrapolation fails.
RoPE with raw frequencies also degrades beyond the training length. But the rotation structure enables systematic fixes. Interpolation methods compress the effective position indices to fit within the trained range while preserving relative position information. This has been extensively validated: LLaMA-based models routinely extend from 4K to 100K+ tokens using RoPE interpolation plus continued fine-tuning.
ALiBi extrapolates natively because the linear bias function is defined for all distances and the model never sees absolute position information. The penalty at distance is always , whether or . This works well for moderate extrapolation but the linear assumption limits flexibility.
Common Confusions
Sinusoidal encoding does not directly encode relative position
The original paper noted that is a linear function of , suggesting the model could learn relative position. In practice, this linear relationship must be extracted by the attention layers, and learned positional embeddings (BERT, GPT-2) often outperform fixed sinusoidal encodings. RoPE makes relative position explicit in the attention computation.
RoPE is not applied to the value vectors
RoPE rotates only the query and key vectors. The value vectors are not modified by position. This is by design: the attention weights (computed from Q and K) determine how values are mixed. Position affects which tokens attend to which, not what information is extracted.
ALiBi does not prevent long-range attention
The linear penalty reduces but does not eliminate attention to distant tokens. A strong content-based match ( is large) can overcome the distance penalty. ALiBi introduces a bias toward locality, not a hard cutoff. This is conceptually similar to a learned sliding window but with soft boundaries.
Learned positional embeddings are a fourth option
BERT and GPT-2 use learned absolute positional embeddings (a lookup table of vectors added to the input). These are conceptually simpler than sinusoidal encodings but cannot extrapolate at all because unseen positions have no embedding. They are largely historical for modern autoregressive models but still common in encoder-only architectures.
References
- Vaswani, A. et al. (2017). "Attention Is All You Need." NeurIPS 2017. Section 3.5 (Sinusoidal positional encoding).
- Su, J. et al. (2021). "RoFormer: Enhanced Transformer with Rotary Position Embedding." arXiv:2104.09864. (RoPE derivation and properties.)
- Press, O. et al. (2022). "Train Short, Test Long: Attention with Linear Biases Enables Input Length Extrapolation." ICLR 2022. (ALiBi.)
- Chen, S. et al. (2023). "Extending Context Window of Large Language Models via Positional Interpolation." arXiv:2306.15595. (Position interpolation for RoPE.)
- Peng, B. et al. (2023). "YaRN: Efficient Context Window Extension of Large Language Models." arXiv:2309.00071. (NTK-aware interpolation and attention scaling for RoPE.)
- Touvron, H. et al. (2023). "Llama 2: Open Foundation and Fine-Tuned Chat Models." arXiv:2307.09288. Section 2 (RoPE usage in LLaMA architecture.)
- Workshop, BigScience et al. (2023). "BLOOM: A 176B-Parameter Open-Access Multilingual Language Model." arXiv:2211.05100. (ALiBi usage in BLOOM.)