Beta. Content is under active construction and has not been peer-reviewed. Report errors on GitHub.Disclaimer

Comparison

Transformer vs. Mamba vs. TTT

Three competing sequence architectures: attention (exact retrieval, quadratic cost), state-space models (linear cost, compressed state), and test-time training (gradient-based state updates, rich memory). Each makes different tradeoffs between memory, compute, and retrieval ability.

What Each Does

Transformers process sequences by letting every token attend to every other token. The KV cache stores all past tokens explicitly. Cost per token: O(nd)O(nd) where nn is sequence length. Memory: grows linearly with context.

Mamba (SSMs) processes sequences through a learned linear recurrence. The state is a fixed-size matrix hRd×N\mathbf{h} \in \mathbb{R}^{d \times N} (typically N=16N = 16). Cost per token: O(dN)O(dN). Memory: constant regardless of context length.

TTT (Test-Time Training) processes sequences by updating a weight matrix WRdinner×dinnerW \in \mathbb{R}^{d_{\text{inner}} \times d_{\text{inner}}} via gradient descent on a self-supervised loss at each token. Cost per token: O(dinner2)O(d_{\text{inner}}^2). Memory: constant (the weight matrix).

Side-by-Side Comparison

PropertyTransformerMambaTTT
State typeKV cache (all past tokens)Fixed-size vector/matrixWeight matrix
State sizeO(nd)O(n \cdot d)O(dN)O(d \cdot N), N16N \approx 16O(dinner2)O(d_{\text{inner}}^2)
Per-token costO(nd)O(n \cdot d)O(dN)O(d \cdot N)O(dinner2)O(d_{\text{inner}}^2)
Retrieval abilityExact (any past token)Approximate (compressed)Learned (gradient-based)
Information capacityUnbounded (grows with nn)O(dN)O(dN) bitsO(dinner2)O(d_{\text{inner}}^2) bits
Long-context scalingQuadratic costLinear costLinear cost
Parallelizable (training)YesYes (parallel scan)Partially
In-context learningStrong (induction heads)Weak (no content-addressable memory)Moderate to strong

When Each Wins

Transformers win when:

Mamba wins when:

TTT wins when:

The Hybrid Trend

As of 2025-2026, the industry is converging on hybrid architectures:

The likely endpoint: different layers in the same model use different mechanisms depending on what that layer needs to do. Early layers use SSMs for cheap long-range mixing. Middle layers use TTT or attention for precise retrieval. Final layers use attention for generation quality.

Common Confusions

Watch Out

Mamba is not strictly worse than attention

Mamba processes each token in O(dN)O(dN) vs attention's O(nd)O(nd). For context lengths where n>Nn > N (which is almost always, since N16N \approx 16), Mamba is cheaper per token. The tradeoff is retrieval precision, not compute. On tasks that do not require exact retrieval (audio classification, time-series prediction), Mamba matches transformers at much lower cost.

Watch Out

TTT is not just fine-tuning during inference

TTT updates a single small layer's weights using a self-supervised loss. It does not fine-tune the whole model. The update is fast (one gradient step), local (one layer), and unsupervised (no labels needed). It is closer to online learning than to fine-tuning.

References