Paper breakdown
Mamba: Linear-Time Sequence Modeling with Selective State Spaces
Albert Gu and Tri Dao · 2023 · COLM 2024
Makes the state-space sequence model's transition matrices input-dependent, restoring the content-based selectivity that linear-time-invariant SSMs lack. Introduces a hardware-aware parallel scan that keeps training fast on GPUs even though the recurrence is now non-LTI. Reaches transformer-quality language modelling at $O(n)$ inference, on the same scaling-law slope as a Pythia-style transformer up to ~3B parameters.
Overview
Gu and Dao (2023) extended the structured state-space model (S4, S5, H3) line so it could match transformer language-modelling quality. The earlier S-models computed sequence outputs from a linear time-invariant (LTI) recurrence: a fixed transition matrix , a fixed input projection , applied at every timestep. LTI structure makes a parallel "global convolution" reformulation possible, which is what made S4 trainable at GPU scale, but it costs the model the ability to select — to decide, based on the current token, which past information to remember.
Mamba's contribution is to make the SSM parameters (and effectively , after discretisation) input-dependent: each step computes them from the current input via a linear projection. This is the Selective State Space Model (S6). The recurrence is no longer time-invariant, so the global-convolution trick from S4 does not apply. The paper compensates with a hardware-aware parallel scan kernel that performs the recurrence in work and depth on the GPU, fused with the discretisation step so that intermediate states never leave SRAM.
The empirical headline is that a Mamba decoder reaches transformer parity on language modelling (The Pile, GPT-2-style training) up to 2.8B parameters, in the same compute-per-parameter regime, while inference is linear in sequence length — instead of for attention, with no KV cache. On long-context tasks where the LTI predecessors collapsed (Selective Copying, Induction Heads), Mamba succeeds where S4 fails. This is the first SSM that is genuinely competitive with the transformer architecturally, not just on speed.
The longer-term picture: Mamba did not displace transformers — production frontier models in 2025–2026 are still attention-dominant — but it broke the consensus that quadratic attention is the only way to reach LLM quality, and the resulting hybrid space (Jamba, Mamba-2 attention-mixed, Hymba) is now an active line of work.
Mathematical Contributions
The continuous SSM
A linear state-space model maps a 1D input to a 1D output via a hidden state :
Here , , . For language modelling this is applied per channel of the residual stream, so the network operates on parallel scalar SSMs.
Discretisation
Apply zero-order hold with step size :
The discrete recurrence is then
S4 (Gu et al. 2022) makes a structured (HiPPO-initialised, diagonal-plus-low-rank) matrix so has fast matrix-vector multiplies; under LTI parameters this recurrence has a closed-form global convolution
which trains in with FFT.
Selectivity: making input-dependent
Mamba parameterises:
with to keep step sizes positive. is kept input-independent (still HiPPO-initialised), but the effective discrete transition varies with the input through .
The recurrence is now
This breaks LTI structure: is no longer the right form because changes per step. The convolutional trick is gone.
Why selectivity matters
Section 3 of the paper introduces two diagnostic synthetic tasks where LTI SSMs fail and selectivity recovers them.
Selective Copying. Tokens of one type are interleaved with random tokens; the model must copy the typed tokens in order. An LTI model has no mechanism to ignore a subset of tokens based on their content; an attention or selective-SSM model can. S4 fails this task; Mamba solves it.
Induction Heads. Given a previous occurrence of the bigram "" in the sequence, the model must complete a later "?" with "". This needs content-based lookup; LTI SSMs cannot do it. Selective SSMs can, and the paper shows Mamba learns it cleanly.
These tasks isolate selectivity as the missing capability and make the architectural argument that any LM-grade sequence model must have it.
The hardware-aware parallel scan
The non-LTI recurrence still has the form , which is a linear first-order recurrence in with input-varying coefficients. Linear recurrences over an associative operator can be evaluated with a parallel scan (Blelloch 1990) in work and depth. The associative combiner here is
which represents the composition of two affine updates. Mamba's scan kernel computes the recurrence in this style.
The "hardware-aware" part is that the kernel fuses the discretisation step (), the scan, and the output projection, so the per-token state is materialised only in SRAM. HBM traffic is instead of the naive — the same kind of IO-aware redesign that FlashAttention applies to attention. Without this, the wall-clock time of a non-LTI scan kills the architecture's competitiveness.
Inference complexity
| Operation | Transformer | Mamba |
|---|---|---|
| Training time per sequence | (attention) | (scan) |
| Sequential operations | (with attention) | (with parallel scan) |
| Inference state per token | KV-cache: grows as | Fixed-size hidden state |
| Generation step | for token | , independent of |
For the inference cost gap is large. The paper reports 4–5× higher throughput than a 6.7B GPT-style model at long sequence length and matched quality.
The Mamba block
The architectural unit replaces a transformer block (attention + MLP) with an SSM-flavoured analog: an input projection that doubles the channel dimension, a 1-D causal convolution over the sequence axis (kernel size 4), a SiLU nonlinearity, the selective SSM, and an output projection. Two of these blocks are roughly comparable in parameter count to one attention + MLP transformer block, and the paper uses the doubled-block-count layout in its language-model architecture.
What It Gets Right
The first thing is the diagnostic. Selective Copying and Induction Heads are not just benchmarks but identifications of the specific capability gap that LTI models had. The paper does not just claim "selectivity helps"; it constructs minimal tasks where the absence of selectivity provably makes the task impossible, and shows the new model solves them. Subsequent ablations and follow-ups have used these tasks as standard probes.
The second is the systems work. Without the parallel-scan kernel the architecture would not have been competitive on wall-clock time, and the paper would have read as a theory note rather than a transformer alternative. The fused-discretise-scan-project layout, plus the careful state-in-SRAM design, is most of what makes the result usable.
The third is the disciplined comparison. Mamba is matched against Pythia, GPT-Neo, and the H3 SSM family at controlled FLOP and parameter budgets, with the same data and tokenizer where possible. The conclusion ("matches transformer perplexity at the same scale up to 2.8B parameters") is reported with numbers and uncertainty, not asserted by vibes.
Common Misconceptions
Mamba is not a transformer replacement at frontier scale. The paper's largest model is 2.8B parameters; transformer frontier models are 1–2 orders of magnitude larger. Subsequent work (Mamba-2, Jamba, Zamba) shows the gap narrowing but not closing across all task families, and as of 2026 most frontier LLM training still uses attention-dominant architectures, often hybridised with state-space layers rather than replaced by them.
Mamba is not "RNN-like only at inference time". The training-time recurrence is identical to the inference-time recurrence — there is no separate parallelisable training mode. What makes training tractable is the parallel scan, which evaluates the same recurrence in depth using associativity, not a separate convolutional reformulation as in S4.
Mamba's hidden state is not "compressed memory of all past tokens" in the way the paper's promotional summaries sometimes suggest. The state has fixed size regardless of , which means there is a strict information bottleneck — recent work (Arora et al. 2024 on associative recall, Park et al. 2024 on retrieval) shows Mamba underperforms attention on long-context tasks that require precise content-addressed retrieval. The hidden state is a learned summary, not a lossless compression.
Selectivity is not "soft attention". The selective parameters depend on the current input alone, not on a query-key dot product over the full sequence. This is what keeps the model linear in but is also why content-based recall over long ranges is harder for Mamba than for attention.
Connections to TheoremPath Topics
- State-space models — the LTI predecessors (S4, S5, H3) and the discretisation framework Mamba builds on.
- Mamba and state-space models — modern treatment including Mamba-2 and the duality with linear attention.
- Recurrent neural networks — the broader class of fixed-state-size sequence models.
- Attention mechanism theory — the architecture Mamba is positioned against on quality.
- FlashAttention — the same "fuse the kernel, keep state in SRAM" systems philosophy applied to attention.
- KV cache optimization — what Mamba avoids by maintaining a fixed-size hidden state instead of a growing cache.
- Sparse attention and long context — the alternative path to subquadratic attention through structural sparsity.
Further Reading
- Dao, T., & Gu, A. (2024). "Transformers are SSMs: Generalized Models and Efficient Algorithms Through Structured State Space Duality." ICML. arXiv:2405.21060. Mamba-2. Establishes a duality between linear attention and structured SSMs and gives a faster GPU-friendly parameterisation.
- Gu, A., Goel, K., & Ré, C. (2022). "Efficiently Modeling Long Sequences with Structured State Spaces." ICLR. arXiv:2111.00396. S4. The LTI predecessor whose convolutional trick Mamba sacrifices in exchange for selectivity.
- Lieber, O. et al. (2024). "Jamba: A Hybrid Transformer-Mamba Language Model." arXiv:2403.19887. 52B-parameter Mamba-attention hybrid; demonstrates the production tradeoff curve.
- Arora, S. et al. (2024). "Simple linear attention language models balance the recall-throughput tradeoff." arXiv:2402.18668. Quantifies the recall gap between linear-time architectures and attention.
References
Canonical:
- Gu, A., & Dao, T. (2023). "Mamba: Linear-Time Sequence Modeling with Selective State Spaces." COLM 2024. arXiv:2312.00752.
Direct precursors:
- Gu, A. et al. (2020). "HiPPO: Recurrent Memory with Optimal Polynomial Projections." NeurIPS. arXiv:2008.07669. The structured- initialisation used by S4 and inherited by Mamba.
- Gu, A., Goel, K., & Ré, C. (2022). "Efficiently Modeling Long Sequences with Structured State Spaces." ICLR. arXiv:2111.00396. S4.
- Smith, J. T. H., Warrington, A., & Linderman, S. W. (2023). "Simplified State Space Layers for Sequence Modeling." ICLR. arXiv:2208.04933. S5; demonstrates that the parallel scan also works for the LTI variant.
- Fu, D. Y. et al. (2023). "Hungry Hungry Hippos: Towards Language Modeling with State Space Models." ICLR. arXiv:2212.14052. H3 — the first SSM to come close to transformer LM quality at small scale.
- Blelloch, G. E. (1990). "Prefix Sums and Their Applications." Carnegie Mellon technical report CMU-CS-90-190. Source of the parallel-scan algorithm used in the kernel.
Follow-on work:
- Dao, T., & Gu, A. (2024). "Transformers are SSMs." ICML. arXiv:2405.21060. Mamba-2.
- Lieber, O. et al. (2024). "Jamba: A Hybrid Transformer-Mamba Language Model." arXiv:2403.19887.
- Glorioso, P. et al. (2024). "Zamba: A Compact 7B SSM Hybrid Model." arXiv:2405.16712. Production hybrid; Mamba blocks plus a single shared attention block.
- Dong, X. et al. (2024). "Hymba: A Hybrid-head Architecture for Small Language Models." arXiv:2411.13676. Per-head mix of attention and SSM at NVIDIA.
Limitations and critiques:
- Arora, S. et al. (2024). "Simple linear attention language models balance the recall-throughput tradeoff." arXiv:2402.18668.
- Park, J. et al. (2024). "Can Mamba Learn How to Learn? A Comparative Study on In-Context Learning Tasks." ICML. arXiv:2402.04248. Mamba underperforms attention on tasks requiring precise long-range copy.
Connected topics
Last reviewed: May 7, 2026