Paper breakdown

Mamba: Linear-Time Sequence Modeling with Selective State Spaces

Albert Gu and Tri Dao · 2023 · COLM 2024

Makes the state-space sequence model's transition matrices input-dependent, restoring the content-based selectivity that linear-time-invariant SSMs lack. Introduces a hardware-aware parallel scan that keeps training fast on GPUs even though the recurrence is now non-LTI. Reaches transformer-quality language modelling at $O(n)$ inference, on the same scaling-law slope as a Pythia-style transformer up to ~3B parameters.

arXiv:2312.00752

Overview

Gu and Dao (2023) extended the structured state-space model (S4, S5, H3) line so it could match transformer language-modelling quality. The earlier S-models computed sequence outputs from a linear time-invariant (LTI) recurrence: a fixed transition matrix $A$ , a fixed input projection $B$ , applied at every timestep. LTI structure makes a parallel "global convolution" reformulation possible, which is what made S4 trainable at GPU scale, but it costs the model the ability to select — to decide, based on the current token, which past information to remember.

Mamba's contribution is to make the SSM parameters $\Delta, B, C$ (and effectively $A$ , after discretisation) input-dependent: each step computes them from the current input $x_t$ via a linear projection. This is the Selective State Space Model (S6). The recurrence is no longer time-invariant, so the global-convolution trick from S4 does not apply. The paper compensates with a hardware-aware parallel scan kernel that performs the recurrence in $O(n)$ work and $O(\log n)$ depth on the GPU, fused with the discretisation step so that intermediate states never leave SRAM.

The empirical headline is that a Mamba decoder reaches transformer parity on language modelling (The Pile, GPT-2-style training) up to 2.8B parameters, in the same compute-per-parameter regime, while inference is linear in sequence length — $O(n)$ instead of $O(n^2)$ for attention, with no KV cache. On long-context tasks where the LTI predecessors collapsed (Selective Copying, Induction Heads), Mamba succeeds where S4 fails. This is the first SSM that is genuinely competitive with the transformer architecturally, not just on speed.

The longer-term picture: Mamba did not displace transformers — production frontier models in 2025–2026 are still attention-dominant — but it broke the consensus that quadratic attention is the only way to reach LLM quality, and the resulting hybrid space (Jamba, Mamba-2 attention-mixed, Hymba) is now an active line of work.

Mathematical Contributions

The continuous SSM

A linear state-space model maps a 1D input $x(t) \in \mathbb{R}$ to a 1D output $y(t) \in \mathbb{R}$ via a hidden state $h(t) \in \mathbb{R}^N$ :

$h'(t) = A h(t) + B x(t),\qquad y(t) = C h(t).$

Here $A \in \mathbb{R}^{N \times N}$ , $B \in \mathbb{R}^{N \times 1}$ , $C \in \mathbb{R}^{1 \times N}$ . For language modelling this is applied per channel of the residual stream, so the network operates on $D$ parallel scalar SSMs.

Discretisation

Apply zero-order hold with step size $\Delta$ :

$\bar A = \exp(\Delta A),\qquad \bar B = (\Delta A)^{-1}(\exp(\Delta A) - I)\, \Delta B.$

The discrete recurrence is then

$h_t = \bar A h_{t-1} + \bar B x_t,\qquad y_t = C h_t.$

S4 (Gu et al. 2022) makes $A$ a structured (HiPPO-initialised, diagonal-plus-low-rank) matrix so $\bar A$ has fast matrix-vector multiplies; under LTI parameters this recurrence has a closed-form global convolution

$y = K * x,\qquad K_t = C \bar A^t \bar B,$

which trains in $O(n \log n)$ with FFT.

Selectivity: making $\Delta, B, C$ input-dependent

Mamba parameterises:

$\Delta_t = \tau_\Delta(\mathrm{Linear}_\Delta(x_t)),\quad B_t = \mathrm{Linear}_B(x_t),\quad C_t = \mathrm{Linear}_C(x_t)$

with $\tau_\Delta = \mathrm{softplus}$ to keep step sizes positive. $A$ is kept input-independent (still HiPPO-initialised), but the effective discrete transition $\bar A_t = \exp(\Delta_t A)$ varies with the input through $\Delta_t$ .

The recurrence is now

$h_t = \bar A_t h_{t-1} + \bar B_t x_t,\qquad y_t = C_t h_t.$

This breaks LTI structure: $K * x$ is no longer the right form because $\bar A$ changes per step. The convolutional trick is gone.

Why selectivity matters

Section 3 of the paper introduces two diagnostic synthetic tasks where LTI SSMs fail and selectivity recovers them.

Selective Copying. Tokens of one type are interleaved with random tokens; the model must copy the typed tokens in order. An LTI model has no mechanism to ignore a subset of tokens based on their content; an attention or selective-SSM model can. S4 fails this task; Mamba solves it.

Induction Heads. Given a previous occurrence of the bigram " $AB$ " in the sequence, the model must complete a later " $A$ ?" with " $B$ ". This needs content-based lookup; LTI SSMs cannot do it. Selective SSMs can, and the paper shows Mamba learns it cleanly.

These tasks isolate selectivity as the missing capability and make the architectural argument that any LM-grade sequence model must have it.

The hardware-aware parallel scan

The non-LTI recurrence still has the form $h_t = a_t h_{t-1} + b_t$ , which is a linear first-order recurrence in $h_t$ with input-varying coefficients. Linear recurrences over an associative operator can be evaluated with a parallel scan (Blelloch 1990) in $O(n)$ work and $O(\log n)$ depth. The associative combiner here is

$(a_2, b_2) \cdot (a_1, b_1) = (a_2 a_1,\, a_2 b_1 + b_2),$

which represents the composition of two affine updates. Mamba's scan kernel computes the recurrence in this style.

The "hardware-aware" part is that the kernel fuses the discretisation step ( $\Delta_t \to \bar A_t, \bar B_t$ ), the scan, and the output projection, so the per-token state $h_t \in \mathbb{R}^N$ is materialised only in SRAM. HBM traffic is $O(n D)$ instead of the naive $O(n D N)$ — the same kind of IO-aware redesign that FlashAttention applies to attention. Without this, the wall-clock time of a non-LTI scan kills the architecture's competitiveness.

Inference complexity

Operation	Transformer	Mamba
Training time per sequence	$O(n^2 d)$ (attention)	$O(n D N)$ (scan)
Sequential operations	$O(1)$ (with attention)	$O(\log n)$ (with parallel scan)
Inference state per token	KV-cache: grows as $O(n d)$	Fixed-size hidden state $h \in \mathbb{R}^{D \times N}$
Generation step	$O(n d)$ for token $n$	$O(D N)$ , independent of $n$

For $n \gg N \cdot d / d_{\mathrm{kv}}$ the inference cost gap is large. The paper reports 4–5× higher throughput than a 6.7B GPT-style model at long sequence length and matched quality.

The Mamba block

The architectural unit replaces a transformer block (attention + MLP) with an SSM-flavoured analog: an input projection that doubles the channel dimension, a 1-D causal convolution over the sequence axis (kernel size 4), a SiLU nonlinearity, the selective SSM, and an output projection. Two of these blocks are roughly comparable in parameter count to one attention + MLP transformer block, and the paper uses the doubled-block-count layout in its language-model architecture.

What It Gets Right

The first thing is the diagnostic. Selective Copying and Induction Heads are not just benchmarks but identifications of the specific capability gap that LTI models had. The paper does not just claim "selectivity helps"; it constructs minimal tasks where the absence of selectivity provably makes the task impossible, and shows the new model solves them. Subsequent ablations and follow-ups have used these tasks as standard probes.

The second is the systems work. Without the parallel-scan kernel the architecture would not have been competitive on wall-clock time, and the paper would have read as a theory note rather than a transformer alternative. The fused-discretise-scan-project layout, plus the careful state-in-SRAM design, is most of what makes the result usable.

The third is the disciplined comparison. Mamba is matched against Pythia, GPT-Neo, and the H3 SSM family at controlled FLOP and parameter budgets, with the same data and tokenizer where possible. The conclusion ("matches transformer perplexity at the same scale up to 2.8B parameters") is reported with numbers and uncertainty, not asserted by vibes.

Common Misconceptions

Mamba is not a transformer replacement at frontier scale. The paper's largest model is 2.8B parameters; transformer frontier models are 1–2 orders of magnitude larger. Subsequent work (Mamba-2, Jamba, Zamba) shows the gap narrowing but not closing across all task families, and as of 2026 most frontier LLM training still uses attention-dominant architectures, often hybridised with state-space layers rather than replaced by them.

Mamba is not "RNN-like only at inference time". The training-time recurrence is identical to the inference-time recurrence — there is no separate parallelisable training mode. What makes training tractable is the parallel scan, which evaluates the same recurrence in $O(\log n)$ depth using associativity, not a separate convolutional reformulation as in S4.

Mamba's hidden state is not "compressed memory of all past tokens" in the way the paper's promotional summaries sometimes suggest. The state $h_t \in \mathbb{R}^{D \times N}$ has fixed size regardless of $n$ , which means there is a strict information bottleneck — recent work (Arora et al. 2024 on associative recall, Park et al. 2024 on retrieval) shows Mamba underperforms attention on long-context tasks that require precise content-addressed retrieval. The hidden state is a learned summary, not a lossless compression.

Selectivity is not "soft attention". The selective parameters $(\Delta_t, B_t, C_t)$ depend on the current input $x_t$ alone, not on a query-key dot product over the full sequence. This is what keeps the model linear in $n$ but is also why content-based recall over long ranges is harder for Mamba than for attention.

Connections to TheoremPath Topics

State-space models — the LTI predecessors (S4, S5, H3) and the discretisation framework Mamba builds on.
Mamba and state-space models — modern treatment including Mamba-2 and the duality with linear attention.
Recurrent neural networks — the broader class of fixed-state-size sequence models.
Attention mechanism theory — the architecture Mamba is positioned against on quality.
FlashAttention — the same "fuse the kernel, keep state in SRAM" systems philosophy applied to attention.
KV cache optimization — what Mamba avoids by maintaining a fixed-size hidden state instead of a growing cache.
Sparse attention and long context — the alternative path to subquadratic attention through structural sparsity.

References

Canonical:

Gu, A., & Dao, T. (2023). "Mamba: Linear-Time Sequence Modeling with Selective State Spaces." COLM 2024. arXiv:2312.00752.

Direct precursors:

Gu, A. et al. (2020). "HiPPO: Recurrent Memory with Optimal Polynomial Projections." NeurIPS. arXiv:2008.07669. The structured- $A$ initialisation used by S4 and inherited by Mamba.
Gu, A., Goel, K., & Ré, C. (2022). "Efficiently Modeling Long Sequences with Structured State Spaces." ICLR. arXiv:2111.00396. S4.
Smith, J. T. H., Warrington, A., & Linderman, S. W. (2023). "Simplified State Space Layers for Sequence Modeling." ICLR. arXiv:2208.04933. S5; demonstrates that the parallel scan also works for the LTI variant.
Fu, D. Y. et al. (2023). "Hungry Hungry Hippos: Towards Language Modeling with State Space Models." ICLR. arXiv:2212.14052. H3 — the first SSM to come close to transformer LM quality at small scale.
Blelloch, G. E. (1990). "Prefix Sums and Their Applications." Carnegie Mellon technical report CMU-CS-90-190. Source of the parallel-scan algorithm used in the kernel.

Follow-on work:

Dao, T., & Gu, A. (2024). "Transformers are SSMs." ICML. arXiv:2405.21060. Mamba-2.
Lieber, O. et al. (2024). "Jamba: A Hybrid Transformer-Mamba Language Model." arXiv:2403.19887.
Glorioso, P. et al. (2024). "Zamba: A Compact 7B SSM Hybrid Model." arXiv:2405.16712. Production hybrid; Mamba blocks plus a single shared attention block.
Dong, X. et al. (2024). "Hymba: A Hybrid-head Architecture for Small Language Models." arXiv:2411.13676. Per-head mix of attention and SSM at NVIDIA.

Limitations and critiques:

Arora, S. et al. (2024). "Simple linear attention language models balance the recall-throughput tradeoff." arXiv:2402.18668.
Park, J. et al. (2024). "Can Mamba Learn How to Learn? A Comparative Study on In-Context Learning Tasks." ICML. arXiv:2402.04248. Mamba underperforms attention on tasks requiring precise long-range copy.

Connected topics

Last reviewed: May 7, 2026