Transformer vs. Mamba vs. TTT: Sequence Architecture Comparison

What Each Does

Transformers process sequences by letting every token attend to every other token. The KV cache stores all past tokens explicitly. Cost per token: $O(nd)$ where $n$ is sequence length. Memory: grows linearly with context.

Mamba (SSMs) processes sequences through a learned linear recurrence. The state is a fixed-size matrix $\mathbf{h} \in \mathbb{R}^{d \times N}$ (typically $N = 16$ ). Cost per token: $O(dN)$ . Memory: constant regardless of context length.

TTT (Test-Time Training) processes sequences by updating a weight matrix $W \in \mathbb{R}^{d_{\text{inner}} \times d_{\text{inner}}}$ via gradient descent on a self-supervised loss at each token. Cost per token: $O(d_{\text{inner}}^2)$ . Memory: constant (the weight matrix).

Side-by-Side Comparison

Property	Transformer	Mamba	TTT
State type	KV cache (all past tokens)	Fixed-size vector/matrix	Weight matrix
State size	$O(n \cdot d)$	$O(d \cdot N)$ , $N \approx 16$	$O(d_{\text{inner}}^2)$
Per-token cost	$O(n \cdot d)$	$O(d \cdot N)$	$O(d_{\text{inner}}^2)$
Retrieval ability	Exact (any past token)	Approximate (compressed)	Learned (gradient-based)
Information capacity	Unbounded (grows with $n$ )	$O(dN)$ bits	$O(d_{\text{inner}}^2)$ bits
Long-context scaling	Quadratic cost	Linear cost	Linear cost
Parallelizable (training)	Yes	Yes (parallel scan)	Partially
In-context learning	Strong (induction heads)	Weak (no content-addressable memory)	Moderate to strong

When Each Wins

Transformers win when:

Precise retrieval matters (copying, lookup, in-context few-shot learning)
Sequence length is moderate (< 8K tokens)
You need the strongest possible language modeling quality
Hardware supports efficient attention (FlashAttention on modern GPUs)

Mamba wins when:

Sequence length is very long (> 32K tokens) and cost must be linear
The task is primarily about aggregation, not retrieval (audio, genomics, time series)
Inference latency per token must be constant regardless of context

TTT wins when:

Contexts are very long AND retrieval of specific information is required
The input distribution changes within the sequence (domain shift within a document)
You want the model to adapt its processing to the specific input at inference time

The Hybrid Trend

As of 2025-2026, the industry is converging on hybrid architectures:

Jamba (AI21): interleaves Mamba layers with attention layers
Mamba-2: shows attention and SSMs have a unified mathematical framework
TTT layers inside transformers: replacing some attention layers with TTT layers

The likely endpoint: different layers in the same model use different mechanisms depending on what that layer needs to do. Early layers use SSMs for cheap long-range mixing. Middle layers use TTT or attention for precise retrieval. Final layers use attention for generation quality.

Common Confusions

Watch Out

Mamba is not strictly worse than attention

Mamba processes each token in $O(dN)$ vs attention's $O(nd)$ . For context lengths where $n > N$ (which is almost always, since $N \approx 16$ ), Mamba is cheaper per token. The tradeoff is retrieval precision, not compute. On tasks that do not require exact retrieval (audio classification, time-series prediction), Mamba matches transformers at much lower cost.

Watch Out

TTT is not just fine-tuning during inference

TTT updates a single small layer's weights using a self-supervised loss. It does not fine-tune the whole model. The update is fast (one gradient step), local (one layer), and unsupervised (no labels needed). It is closer to online learning than to fine-tuning.

References

Vaswani et al., "Attention Is All You Need" (2017). The transformer.
Gu & Dao, "Mamba: Linear-Time Sequence Modeling with Selective State Spaces" (2023)
Sun et al., "Learning to (Learn at Test Time): RNNs with Expressive Hidden States" (ICML 2024)
Dao & Gu, "Transformers are SSMs" (2024). The Mamba-2 / unification paper.