Mamba and State-Space Models

Sneiderman, Robby

Beyond LLMS

Mamba and State-Space Models

Linear-time sequence modeling via structured state spaces: S4, HiPPO initialization, selective state-space models (Mamba), and the architectural fork from transformers.

AdvancedTier 2FrontierFrontier watch~60 min

Prerequisites

Recurrent Neural Networks Attention Mechanism Theory Deep Learning Time Series Efficient Transformers Survey

Prereq Map

Learning position

Read this page in the graph.

beyond-llms | layer 4 | tier 2. This page has 7 direct prerequisites and 1 published dependent.

Open Atlas Prerequisites Leads to

What next

Mixture of Experts

This is the first curated or graph-derived continuation from the current page.

Evidence badge

Source-grounded page

This page has no public Lean mapping yet. Use the evidence page to inspect how claim status labels work.

Show the backing system

AtlasOpen the full prerequisite graph and run grounding traces.EvidenceInspect source support, claim labels, and public trust status.LeanReview the checked declaration list, scopes, and axiom profile.

Why This Matters

Transformers scale quadratically in sequence length: self-attention over $n$ tokens costs $O(n^2)$ time and memory. For long sequences. books, genomics data, high-resolution audio. This is prohibitive. State-space models (SSMs) achieve $O(n)$ scaling by processing sequences through a structured linear recurrence, and Mamba made SSMs competitive with transformers on language tasks.

Mamba is the most compute-efficient alternative to date that reaches transformer-level performance across broad language benchmarks. Earlier alternatives in the same line include Linear Transformers (Katharopoulos et al., 2020), Performer (Choromanski et al., 2020), RWKV (Peng et al., 2023, arXiv:2305.13048), and RetNet (Sun et al., 2023, arXiv:2307.08621), all of which preceded Mamba. Whether SSMs replace, complement, or merge with transformers is an open question, but understanding their mathematics is no longer optional.

Mental Model

Think of an SSM as a linear dynamical system with a hidden state that evolves over time. At each timestep, the input nudges the hidden state, and the output is a linear read-out of that state. The magic is that this linear recurrence can be computed either step-by-step (for autoregressive generation) or as a global convolution (for parallel training). S4 made this efficient; Mamba made it input-dependent.

Formal Setup

Definition

Continuous-Time State-Space Model

A continuous-time SSM maps input $u(t) \in \mathbb{R}$ to output $y(t) \in \mathbb{R}$ via a latent state $\mathbf{h}(t) \in \mathbb{R}^N$ :

$\mathbf{h}'(t) = \mathbf{A}\mathbf{h}(t) + \mathbf{B}u(t)$ $y(t) = \mathbf{C}\mathbf{h}(t) + Du(t)$

where $\mathbf{A} \in \mathbb{R}^{N \times N}$ , $\mathbf{B} \in \mathbb{R}^{N \times 1}$ , $\mathbf{C} \in \mathbb{R}^{1 \times N}$ , $D \in \mathbb{R}$ . We use $u$ for the input and $\mathbf{h}$ for the hidden state throughout, following the Gu-Dao 2023 convention.

Definition

Discretization

For discrete sequences with step size $\Delta$ , the continuous SSM is discretized (e.g., via zero-order hold):

$\bar{\mathbf{A}} = \exp(\Delta \mathbf{A}), \quad \bar{\mathbf{B}} = (\Delta \mathbf{A})^{-1}(\exp(\Delta \mathbf{A}) - \mathbf{I}) \cdot \Delta \mathbf{B}$

The discrete recurrence becomes:

$\mathbf{h}_k = \bar{\mathbf{A}}\mathbf{h}_{k-1} + \bar{\mathbf{B}}u_k$ $y_k = \mathbf{C}\mathbf{h}_k + Du_k$

Main Theorems

Theorem

SSM as Global Convolution

Statement

The output sequence of a discrete LTI state-space model can be computed as a global convolution:

$y_k = \sum_{j=0}^{k} \mathbf{C}\bar{\mathbf{A}}^j\bar{\mathbf{B}}\, u_{k-j} = (\bar{K} * u)_k$

where the convolution kernel is $\bar{K} = (\mathbf{C}\bar{\mathbf{B}},\; \mathbf{C}\bar{\mathbf{A}}\bar{\mathbf{B}},\; \mathbf{C}\bar{\mathbf{A}}^2\bar{\mathbf{B}},\; \ldots)$ . (This formula assumes no direct feedthrough, i.e., $D = 0$ , which is the standard S4 setup; the direct term $Du_k$ would add separately in the general LTI case.)

Intuition

Unrolling the recurrence $\mathbf{h}_k = \bar{\mathbf{A}}\mathbf{h}_{k-1} + \bar{\mathbf{B}}u_k$ shows that the output at time $k$ is a weighted sum of all past inputs, with weights given by powers of $\bar{\mathbf{A}}$ . This is exactly a convolution. During training, you can compute this convolution in $O(n \log n)$ time via FFT. During inference, you use the recurrence directly in $O(1)$ per step.

Why It Matters

This dual view. recurrence for inference, convolution for training. is why SSMs are practical. Training via convolution is parallelizable (like transformers). Inference via recurrence is $O(1)$ per token (like RNNs, unlike transformers which must attend to all previous tokens).

Failure Mode

The convolution view requires the parameters $(\mathbf{A}, \mathbf{B}, \mathbf{C})$ to be time-invariant (the same for every timestep). Mamba breaks this assumption by making parameters input-dependent, which is exactly what makes it powerful. And prevents direct use of the convolution view.

report a correction →

Theorem

HiPPO: Optimal Polynomial Memory

Statement

The HiPPO (High-Order Polynomial Projection Operator) framework derives the matrix $\mathbf{A}$ such that the state $\mathbf{h}(t)$ optimally represents the history of the input as a projection onto a polynomial basis. For HiPPO-LegS, the measure is uniform over $[0, t]$ (scaled Legendre polynomials on the cumulative history; see Gu, Dao, Ermon, Rudra, Re, "HiPPO: Recurrent Memory with Optimal Polynomial Projections," NeurIPS 2020, Theorem 2 / eq. (5)). The alternative HiPPO-LegT uses a uniform sliding-window measure over $[t-\theta, t]$ and yields a different matrix. For LegS:

$A_{nk} = -\begin{cases} (2n+1)^{1/2}(2k+1)^{1/2} & \text{if } n > k \\ n+1 & \text{if } n = k \\ 0 & \text{if } n < k \end{cases}$

With this $\mathbf{A}$ , the state $\mathbf{h}(t)$ contains the coefficients of the optimal degree- $(N-1)$ polynomial approximation to the input history.

Intuition

The hidden state of the SSM is a compressed summary of the input history. The HiPPO matrix is the provably optimal choice for preserving as much information about the history as possible in $N$ dimensions, where "information" is measured as polynomial approximation quality under a specific measure.

Why It Matters

Before HiPPO, SSMs used random or heuristic initialization for $\mathbf{A}$ and could not model long-range dependencies. HiPPO provided the principled initialization for SSMs. S4's contribution was the DPLR (diagonal plus low-rank) parameterization and the Cauchy-kernel FFT computation that made HiPPO-initialized SSMs practical at scale (Gu, Goel, Re, ICLR 2022, Section 3).

report a correction →

Proposition

Selective State-Space Model (Mamba)

Statement

Mamba modifies the standard SSM by making $\mathbf{B}$ , $\mathbf{C}$ , and $\Delta$ functions of the input:

$\mathbf{B}_k = \text{Linear}_B(u_k), \quad \mathbf{C}_k = \text{Linear}_C(u_k), \quad \Delta_k = \text{softplus}(\text{Linear}_\Delta(u_k))$

This makes the model input-dependent (selective): the dynamics change at every timestep based on the input, allowing the model to decide what to remember and what to forget as a function of the current token.

Intuition

A standard SSM applies the same dynamics to every input. It cannot decide that a particular token is important and should be remembered, or that another is noise and should be forgotten. Mamba's selectivity is analogous to the gating mechanism in LSTMs, but operating within the SSM framework. The input-dependent $\Delta$ controls the "timescale". large $\Delta$ means "pay attention to this input," small $\Delta$ means "coast on the current state."

Why It Matters

Selectivity is what made SSMs competitive with transformers on language. Without it, SSMs could model continuous signals (audio, time series) but struggled with discrete, information-dense sequences (text) where the model must selectively attend to specific tokens. The selective scan is Mamba's core contribution.

Failure Mode

Input-dependent parameters break the convolution view. Mamba compensates with a hardware-aware "selective scan" algorithm that computes the recurrence efficiently on GPUs using parallel scan primitives. This is fast in practice but more complex to implement than the FFT-based convolution of S4.

report a correction →

Complexity Comparison

Operation	Transformer	S4 (LTI-SSM)	Mamba (Selective SSM)
Training (length $n$ )	$O(n^2 d)$	$O(n \log n \cdot d)$	$O(n \cdot d \cdot N)$
Inference per token	$O(n \cdot d)$	$O(d \cdot N)$	$O(d \cdot N)$
State size per layer	$O(n \cdot d)$ (KV cache)	$O(d \cdot N)$	$O(d \cdot N)$

where $d$ is the model dimension and $N$ is the SSM state dimension (typically $N = 16$ ). The $O(n)$ vs $O(n^2)$ gap matters enormously for long sequences.

Limitations

SSMs, including Mamba, have known weaknesses:

In-context retrieval: Tasks requiring precise copying or retrieval of specific tokens from earlier in the sequence (e.g., "repeat the word at position 5000") are hard for SSMs because the hidden state is a compressed summary, not a content-addressable memory like the KV cache.
Associative recall: Tasks like "what value was associated with key X?" at long range are limited by the state dimension $N$ . Arora, Eyuboglu, Timalsina, Johnson, Poli, Zou, Rudra, Re (2023, "Zoology: Measuring and Improving Recall in Efficient Language Models," arXiv:2312.04927) show that SSMs with state dimension $N$ need $N \geq \Omega(K)$ to reliably retrieve $K$ distinct (key, value) pairs.
Induction heads: Certain attention patterns (induction heads) that are crucial for in-context learning in transformers have no obvious SSM analog.

What Aged Badly

Early claims (2023-2024) that SSMs would "replace transformers" were premature. As of 2025, the trajectory points toward hybrid architectures: models that combine SSM layers (for efficient long-range processing) with attention layers (for precise retrieval and in-context learning). Jamba (AI21), Zamba, and several research models follow this hybrid pattern. Pure SSM models have not matched frontier transformer performance on general language benchmarks.

Common Confusions

Watch Out

SSMs are not just 'fast RNNs'

While SSMs share the recurrent structure of RNNs, the key difference is the structured state matrix $\mathbf{A}$ (HiPPO) and the dual recurrence/convolution view. Vanilla RNNs have unstructured hidden states that lose long-range information exponentially. SSMs with HiPPO initialization provably maintain polynomial approximations of the input history. The mathematical framework is structurally different.

Watch Out

Linear dynamics does not mean linear model

The state dynamics are linear ( $\mathbf{h}_k = \bar{\mathbf{A}}\mathbf{h}_{k-1} + \bar{\mathbf{B}}u_k$ ), but the overall model is nonlinear: Mamba wraps the SSM in nonlinear activations, gating, and input-dependent parameterization. The linearity of the core dynamics enables efficient computation; the surrounding nonlinearity provides expressiveness.

Summary

SSMs model sequences via linear state dynamics: recurrence for inference, convolution for training
HiPPO initialization provides provably optimal long-range memory
S4 = structured state spaces + HiPPO + efficient FFT computation
Mamba = S4 + input-dependent (selective) parameters. The model learns what to remember and what to forget
$O(n)$ in sequence length vs $O(n^2)$ for standard attention
Weakness: precise in-context retrieval is harder without content-addressable memory
Hybrid SSM-attention architectures are the likely convergence point

Exercises

ExerciseCore

Problem

For an LTI SSM with state dimension $N = 16$ and sequence length $n = 100{,}000$ , compare the per-token inference cost to a transformer with hidden dimension $d = 4096$ . Which is cheaper, and by how much?

ExerciseAdvanced

Problem

Explain why making $\mathbf{B}$ and $\mathbf{C}$ input-dependent (as in Mamba) breaks the global convolution view of SSMs. What computational technique does Mamba use instead?

ExerciseResearch

Problem

Hybrid models (e.g., Jamba) interleave SSM layers and attention layers. From first principles, argue which layers in a deep network should be SSM layers and which should be attention layers. What tasks or capabilities does each layer type serve?

DeltaNet and the Delta Rule

DeltaNet (Schlag et al., 2021; Yang et al., 2024) applies the classical delta rule from associative memory to sequence modeling. Where Mamba uses a diagonal state matrix $\bar{\mathbf{A}}$ , DeltaNet uses a dense state update that writes key-value associations into a matrix memory:

$M_t = M_{t-1} + v_t k_t^\top - (M_{t-1} k_t) k_t^\top$

The first term writes a new association. The second term erases the old value associated with key $k_t$ before writing the new one. This is the delta rule: update by the error between the desired value $v_t$ and the current retrieval $M_{t-1} k_t$ . (Shown here with gate $\beta_t = 1$ ; the full DeltaNet of Yang et al. 2024, arXiv:2406.06484, includes a per-token gate $\beta_t \in (0, 1]$ giving $S_t = S_{t-1} + \beta_t (v_t - S_{t-1} k_t) k_t^\top$ , which interpolates between no update and full replacement.)

DeltaNet can be seen as a linear attention variant where the state matrix $M \in \mathbb{R}^{d \times d}$ serves as an associative memory. Unlike Mamba's diagonal state (which compresses information into $N$ scalars), DeltaNet's matrix state can store and retrieve key-value pairs, giving it stronger in-context retrieval ability.

The cost is $O(d^2)$ state size per layer (vs Mamba's $O(dN)$ with $N \ll d$ ). DeltaNet occupies a middle ground between SSMs (cheap state, weak retrieval) and attention (expensive but perfect retrieval).

Mamba-2 and Structured State Space Duality

Mamba-2 (Dao and Gu, 2024, arXiv:2405.21060) reframes selective SSMs through the lens of structured state space duality (SSD). The core observation is that the recurrent computation of a selective SSM with scalar-times-identity transition ( $\bar{\mathbf{A}}_k = a_k \mathbf{I}$ ) can be written as multiplication by a 1-semiseparable matrix: a lower-triangular matrix $L$ whose $(i, j)$ entry is $C_i (a_i a_{i-1} \cdots a_{j+1}) B_j$ for $i \geq j$ . Any matrix whose off-diagonal blocks have rank at most one admits such a structure, and the family of 1-semiseparable matrices includes the Toeplitz kernels of LTI-SSMs as a special case. This matrix view makes the connection to attention explicit: causal linear attention is also a lower-triangular matmul, and both mechanisms sit on the same continuum of structured matrix products. SSD exploits this to train Mamba-2 with dense matmul kernels on tensor cores (rather than a custom scan), yielding 2-8x faster training than Mamba-1 at equal quality, and it unifies the design space of linear attention, selective SSMs, and masked attention under one algebraic framework.

Related Comparisons

Transformer vs. Mamba vs. TTT

References

Canonical:

Gu, Goel, Re, "Efficiently Modeling Long Sequences with Structured State Spaces" (ICLR 2022, arXiv:2111.00396). S4.
Gu and Dao, "Mamba: Linear-Time Sequence Modeling with Selective State Spaces" (2023, arXiv:2312.00752).
Gu, Dao, Ermon, Rudra, Re, "HiPPO: Recurrent Memory with Optimal Polynomial Projections" (NeurIPS 2020).
Fu, Dao, Saab, Thomas, Rudra, Re, "Hungry Hungry Hippos: Towards Language Modeling with State Space Models" (ICLR 2023, arXiv:2212.14052). H3.
Katharopoulos, Vyas, Pappas, Fleuret, "Transformers are RNNs: Fast Autoregressive Transformers with Linear Attention" (ICML 2020, arXiv:2006.16236). Linear attention / attention-as-RNN.

Current:

Dao and Gu, "Transformers are SSMs: Generalized Models and Efficient Algorithms through Structured State Space Duality" (ICML 2024, arXiv:2405.21060). Mamba-2 and SSD.
Lieber et al., "Jamba: A Hybrid Transformer-Mamba Language Model" (2024).
Yang et al., "Parallelizing Linear Transformers with the Delta Rule over Sequence Length" (NeurIPS 2024, arXiv:2406.06484). DeltaNet.
Peng et al., "RWKV: Reinventing RNNs for the Transformer Era" (EMNLP Findings 2023, arXiv:2305.13048).
Beck, Poppel, Spanring, Auer, Prudnikova, Kopp, Klambauer, Brandstetter, Hochreiter, "xLSTM: Extended Long Short-Term Memory" (NeurIPS 2024, arXiv:2405.04517). Modern LSTM revival with matrix memory and exponential gating.
De, Smith, Fernando, Botev, De Cristofaro, Dedieu, Garcia-Ceja, Morafah, Ionescu, Cassirer, Razavi, Mensch, Rajamanoharan, Ring, Doucet, Hennigan, Pascanu, Hadsell, Teh, Pascanu, Gulcehre, "Griffin: Mixing Gated Linear Recurrences with Local Attention for Efficient Language Models" (2024, arXiv:2402.19427). Griffin / Hawk / RecurrentGemma.
Arora, Eyuboglu, Timalsina, Johnson, Poli, Zou, Rudra, Re, "Zoology: Measuring and Improving Recall in Efficient Language Models" (2023, arXiv:2312.04927).

Next Topics

The natural next steps from state-space models:

Mixture of experts: another approach to scaling efficiency, often combined with SSMs in hybrid architectures
Context engineering: managing what the model sees, especially relevant when the architecture has limited retrieval capacity

Last reviewed: May 26, 2026

Canonical graph

Required before and derived from this topic

These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.

Full prerequisite chain All derived topics

Required prerequisites

7

State Space Modelslayer 2 · tier 2
Deep Learning for Time Serieslayer 3 · tier 2
Recurrent Neural Networkslayer 3 · tier 2
Attention Mechanism Theorylayer 4 · tier 2
Efficient Transformers Surveylayer 4 · tier 2

Derived topics

1

Context Engineeringlayer 5 · tier 2

Graph-backed continuations

Context Engineering