LLM Construction
Forgetting Transformer (FoX)
FoX adds a data-dependent forget gate to softmax attention. The gate down-weights unnormalized attention scores between past and present positions, giving the transformer a learned, recency-biased decay. FoX is FlashAttention-compatible, works without positional embeddings, and improves long-context language modeling and length extrapolation.
Why This Matters
Standard softmax attention has no built-in notion of recency. Position is injected externally through positional embeddings (sinusoidal, learned, RoPE, ALiBi), and the attention weights themselves are computed from content alone. Recurrent models do the opposite: LSTMs and state-space models carry a learned forget gate that decides, at each step, how much of the past to keep.
FoX (Forgetting Transformer) asks whether softmax attention can borrow that mechanism. The answer is yes, with a small and clean modification: compute a scalar forget gate per token and per head, take a running product of these gates between positions and , add the log of this product to the pre-softmax attention logits. That is it. No positional embeddings are required. The resulting model trains at the same speed as a standard transformer because the modification is compatible with the FlashAttention kernel. On long-context language modeling and length extrapolation, FoX beats a tuned RoPE baseline. On short-context downstream tasks it performs competitively.
The paper is Lin, Nikishin, He, and Courville, Forgetting Transformer: Softmax Attention with a Forget Gate, ICLR 2025 (arXiv:2503.02130). The contribution is narrow but instructive: a single data-dependent scalar per token, injected in the right place, gives transformers the recency inductive bias of recurrent models without giving up parallel training or the FlashAttention fast path.
Mental Model
Think of softmax attention as a content-based lookup with no decay: token can attend equally to any earlier token , and nothing in the attention weight itself punishes distance. Positional embeddings fix this indirectly by encoding as a content signal.
FoX adds a direct, multiplicative decay. Each token emits a forget value per attention head. The "keep mass" from token to token is the product
If every intermediate is near 1, the keep mass is near 1 and FoX recovers ordinary attention. If one intermediate is near 0, it closes the gate and the contribution from all tokens before it is suppressed for all future queries. The gate is data-dependent, so the model can decide, based on the current input, when to wipe context.
In log space this decay becomes additive. That is the key to efficient implementation: adding to the attention logits is the same kind of operation ALiBi already performs, except the slope is data-dependent instead of a fixed head-specific constant.
Formal Setup
Forget Gate (FoX)
For each attention head and each token position , the forget gate is a scalar in computed from the current hidden state :
where is the sigmoid, and , are per-head learnable parameters. The gate is a scalar per head, not a vector, so its parameter cost is negligible.
Forgetting Attention
Let denote query, key, and value vectors at positions for a fixed head. Define the cumulative log forget gate
Forgetting Attention is the causal attention
Equivalently, letting ,
The factor down-weights the unnormalized attention score between positions and , with no change to the softmax denominator structure.
The modification lives entirely inside the attention logits. Adding a bias to is the same access pattern FlashAttention already handles for causal masks and ALiBi, so FoX reuses the FlashAttention kernel without a new fused implementation.
Main Theorems
Forgetting Attention is Attention With Log-Domain Decay
Statement
Let with . Then Forgetting Attention can be written as standard softmax attention on modified logits
In particular, when the bias vanishes and FoX reduces to standard causal softmax attention.
Intuition
The forget gate never enters the softmax denominator separately. It is folded into the logits as a position-dependent bias. This is the structural reason FoX keeps FlashAttention compatibility: the kernel only needs one extra bias term per pair, which can be computed on the fly from a prefix sum of .
Proof Sketch
Start from the Forgetting Attention definition and factor . Then . Both numerator and denominator share the same exponential, so the ratio is exactly the softmax of the shifted logits . When for all , , recovering standard attention.
Why It Matters
This reformulation is what makes FoX trainable at transformer speed. Prefix sums of scalars are cheap, and the shifted logits drop into any attention kernel that supports causal masking plus a bias. Compare to linear attention variants where the decay must be baked into a custom recurrence: FoX keeps the softmax and the associativity of the kernel untouched.
Failure Mode
If all gates saturate near 1, FoX becomes a plain transformer with wasted parameters. If all gates saturate near 0, only the immediate previous token contributes, collapsing the model to a very short effective context. Initialization that pushes to a positive value (analogous to LSTM forget-gate bias tricks) is important to keep gates open early in training.
Recency Bias of Forgetting Attention
Statement
Fix a head and a query position . Suppose the forget gates satisfy for all . Then the multiplicative weight on the key at position is at most
In particular, the effective attention weight decays at least geometrically in the gap , uniformly in the content similarity , where is the normalizer.
Intuition
Once the gate stays strictly below 1, FoX imposes a data-independent upper bound on how much a far-away token can contribute, regardless of how well its key matches the query. Content similarity can still push attention toward older tokens, but only up to the ceiling set by the cumulative product. This is a provable recency bias that standard attention lacks.
Proof Sketch
Each factor in is at most , and there are factors, giving . The numerator contribution of key is and is divided by , so the normalized weight on is at most .
Why It Matters
Length extrapolation and stable long-context behavior require the model to refuse to attend to arbitrarily old tokens with arbitrary confidence. FoX enforces this structurally. This is one reason FoX generalizes past its training length without ALiBi-style slope tuning or RoPE tricks.
Failure Mode
If the gate is not well-regulated, a single near-zero kills all information from before position . For tasks that require a specific long-range lookup (e.g., needle-in-a-haystack retrieval), this can be harmful. The paper reports FoX holds up on retrieval tasks, but the theoretical risk is real: the gate must learn to stay open for task-relevant anchors.
The Pro Block
The paper also introduces a "Pro" block design that layers several small components from recurrent and efficient-attention literature around the Forgetting Attention core. The components are:
- Output gate: a sigmoid gate applied to the attention output before the residual, similar to gated linear attention variants.
- Output normalization: an RMSNorm on the attention output prior to the output projection.
- QK-norm: RMSNorm applied to queries and keys before the dot product, which stabilizes logits at long context.
- KV-shift: a simplified, learned, data-dependent token-shift on keys and values, borrowed from the short-convolution tradition in RWKV and Mamba-like designs.
Each piece is cheap in parameters and compatible with standard attention kernels. The paper reports that Pro blocks improve both FoX and the ordinary transformer, but the improvement for FoX is larger. The takeaway: the forget gate is the architectural change of interest, and the Pro block is a collection of well-trodden stabilizers that happen to pair well with it.
Historical Context
Multiplicative gating in sequence models goes back to LSTM (Hochreiter and Schmidhuber, 1997), which introduced input, forget, and output gates to stabilize gradient flow through long sequences. Highway Networks (Srivastava et al., 2015) carried the idea into feedforward depth. GRU (Cho et al., 2014) compressed LSTM gating to a single update gate.
In attention, ALiBi (Press et al., 2022) added a fixed, non-learned linear bias in the attention logits that decays with distance. FoX can be read as a data-dependent generalization of ALiBi: instead of a constant per-head slope, each head gets a sequence of gates whose cumulative log acts as the bias. Linear-attention variants like RetNet, GLA, and Mamba reached similar conclusions from a different direction, building recurrent state-space models with data-dependent decays. FoX keeps softmax attention and inherits its expressivity while borrowing the decay primitive.
Common Confusions
FoX does not gate the FFN
An earlier draft of this page (and some secondhand descriptions online) claim FoX adds a forget gate to the feed-forward block. That is wrong. The forget gate modifies the unnormalized attention scores inside the softmax, not the FFN output. The FFN block in FoX is the standard MLP or SwiGLU that the baseline uses.
The gate is a scalar per head, not a vector per dimension
Unlike LSTM forget gates, which are vectors that gate each hidden dimension independently, the FoX forget gate is a single scalar per attention head at each position. The scalar multiplies the whole attention contribution between positions and . This is what makes the cumulative product well-defined and FlashAttention-friendly.
FoX replaces positional embeddings, not positional information
FoX is trained and evaluated without RoPE, ALiBi, or sinusoidal positional embeddings. It is not the case that FoX is position-agnostic: the cumulative product depends on the gap and on the intermediate content, which gives the model an implicit, data-shaped sense of distance. The paper's claim is that this implicit signal is strong enough to make explicit positional embeddings unnecessary.
FoX is not a linear-attention method
FoX keeps softmax over the attention logits. Its decay enters as an additive bias in log space, not as a state update in a recurrence. That means FoX retains the full training cost of attention, but also retains its expressivity. Compare to Mamba or RWKV, which use recurrent state to achieve at the cost of changing the computation.
Exercises
Problem
Show that when every forget gate , Forgetting Attention is exactly equal to standard causal softmax attention. Then show that when every is a head-specific constant, Forgetting Attention reduces to attention with an ALiBi-like linear bias. Give the exact slope.
Problem
Let be the per-head forget gates, and let . Suppose your hardware only supports causal attention kernels with a single additive bias term per pair. Describe an preprocessing step that allows Forgetting Attention to be computed by such a kernel with no change to the kernel itself. Why does this preprocessing not apply to a FFN-level forget gate?
References
Canonical:
- Lin, Nikishin, He, and Courville, Forgetting Transformer: Softmax Attention with a Forget Gate (2025), arXiv:2503.02130. Published at ICLR 2025. Primary source for FoX, Forgetting Attention, and the Pro block.
- Hochreiter and Schmidhuber, Long Short-Term Memory (1997), Section 2. Origin of the forget gate in recurrent networks.
Context for attention-level decay:
- Press, Smith, and Lewis, Train Short, Test Long: Attention with Linear Biases Enables Input Length Extrapolation (ICLR 2022). ALiBi: fixed, non-data-dependent linear bias on attention logits. FoX generalizes this.
- Su, Lu, Pan, et al., RoFormer: Enhanced Transformer with Rotary Position Embedding (2021), Sections 3 and 4. RoPE is the standard baseline FoX drops.
Context for recurrent decay:
- Gu and Dao, Mamba: Linear-Time Sequence Modeling with Selective State Spaces (2023), Sections 3.2 and 3.5. Data-dependent selective decay in a state-space model, the closest non-attention counterpart to the FoX gate.
- Peng et al., RWKV: Reinventing RNNs for the Transformer Era (EMNLP Findings 2023), Sections 4 and 5. Data-dependent decay in a recurrent formulation.
Infrastructure:
- Dao, FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning (2023). The kernel FoX reuses via its additive-bias structure.
Further reading:
- Jurafsky and Martin, Speech and Language Processing (3rd ed. draft), Chapters 9 and 10. Background on attention and long-context modeling.
Next Topics
- Attention sinks and retrieval decay: empirical position-dependent failure modes that FoX partially addresses through data-dependent decay.
- Residual stream and transformer internals: how FoX's attention output composes with the residual stream and where the Pro block's output gate and output norm sit.
- Recurrent neural networks: the LSTM forget-gate lineage that FoX adapts to attention logits.
Last reviewed: April 2026
Prerequisites
Foundations this topic depends on.
- Attention Mechanism TheoryLayer 4
- Matrix Operations and PropertiesLayer 0A
- Sets, Functions, and RelationsLayer 0A
- Basic Logic and Proof TechniquesLayer 0A
- Softmax and Numerical StabilityLayer 1
- Recurrent Neural NetworksLayer 3
- Feedforward Networks and BackpropagationLayer 2
- Differentiation in RnLayer 0A
- Matrix CalculusLayer 1
- The Jacobian MatrixLayer 0A
- The Hessian MatrixLayer 0A
- Activation FunctionsLayer 1
- Convex Optimization BasicsLayer 1
- Transformer ArchitectureLayer 4