What Each Does
Transformers process sequences by letting every token attend to every other token. The KV cache stores all past tokens explicitly. Cost per token: where is sequence length. Memory: grows linearly with context.
Mamba (SSMs) processes sequences through a learned linear recurrence. The state is a fixed-size matrix (typically ). Cost per token: . Memory: constant regardless of context length.
TTT (Test-Time Training) processes sequences by updating a weight matrix via gradient descent on a self-supervised loss at each token. Cost per token: . Memory: constant (the weight matrix).
Side-by-Side Comparison
| Property | Transformer | Mamba | TTT |
|---|---|---|---|
| State type | KV cache (all past tokens) | Fixed-size vector/matrix | Weight matrix |
| State size | , | ||
| Per-token cost | |||
| Retrieval ability | Exact (any past token) | Approximate (compressed) | Learned (gradient-based) |
| Information capacity | Unbounded (grows with ) | bits | bits |
| Long-context scaling | Quadratic cost | Linear cost | Linear cost |
| Parallelizable (training) | Yes | Yes (parallel scan) | Partially |
| In-context learning | Strong (induction heads) | Weak (no content-addressable memory) | Moderate to strong |
When Each Wins
Transformers win when:
- Precise retrieval matters (copying, lookup, in-context few-shot learning)
- Sequence length is moderate (< 8K tokens)
- You need the strongest possible language modeling quality
- Hardware supports efficient attention (FlashAttention on modern GPUs)
Mamba wins when:
- Sequence length is very long (> 32K tokens) and cost must be linear
- The task is primarily about aggregation, not retrieval (audio, genomics, time series)
- Inference latency per token must be constant regardless of context
TTT wins when:
- Contexts are very long AND retrieval of specific information is required
- The input distribution changes within the sequence (domain shift within a document)
- You want the model to adapt its processing to the specific input at inference time
The Hybrid Trend
As of 2025-2026, the industry is converging on hybrid architectures:
- Jamba (AI21): interleaves Mamba layers with attention layers
- Mamba-2: shows attention and SSMs have a unified mathematical framework
- TTT layers inside transformers: replacing some attention layers with TTT layers
The likely endpoint: different layers in the same model use different mechanisms depending on what that layer needs to do. Early layers use SSMs for cheap long-range mixing. Middle layers use TTT or attention for precise retrieval. Final layers use attention for generation quality.
Common Confusions
Mamba is not strictly worse than attention
Mamba processes each token in vs attention's . For context lengths where (which is almost always, since ), Mamba is cheaper per token. The tradeoff is retrieval precision, not compute. On tasks that do not require exact retrieval (audio classification, time-series prediction), Mamba matches transformers at much lower cost.
TTT is not just fine-tuning during inference
TTT updates a single small layer's weights using a self-supervised loss. It does not fine-tune the whole model. The update is fast (one gradient step), local (one layer), and unsupervised (no labels needed). It is closer to online learning than to fine-tuning.
References
- Vaswani et al., "Attention Is All You Need" (2017). The transformer.
- Gu & Dao, "Mamba: Linear-Time Sequence Modeling with Selective State Spaces" (2023)
- Sun et al., "Learning to (Learn at Test Time): RNNs with Expressive Hidden States" (ICML 2024)
- Dao & Gu, "Transformers are SSMs" (2024). The Mamba-2 / unification paper.