Beta. Content is under active construction and has not been peer-reviewed. Report errors on GitHub.Disclaimer

Beyond Llms

Mamba and State-Space Models

Linear-time sequence modeling via structured state spaces: S4, HiPPO initialization, selective state-space models (Mamba), and the architectural fork from transformers.

AdvancedTier 2Frontier~60 min
0

Why This Matters

Transformers scale quadratically in sequence length: self-attention over nn tokens costs O(n2)O(n^2) time and memory. For long sequences. books, genomics data, high-resolution audio. This is prohibitive. State-space models (SSMs) achieve O(n)O(n) scaling by processing sequences through a structured linear recurrence, and Mamba made SSMs competitive with transformers on language tasks.

Mamba is the most compute-efficient alternative to date that reaches transformer-level performance across broad language benchmarks. Earlier alternatives in the same line include Linear Transformers (Katharopoulos et al., 2020), Performer (Choromanski et al., 2020), RWKV (Peng et al., 2023, arXiv:2305.13048), and RetNet (Sun et al., 2023, arXiv:2307.08621), all of which preceded Mamba. Whether SSMs replace, complement, or merge with transformers is an open question, but understanding their mathematics is no longer optional.

Mental Model

Think of an SSM as a linear dynamical system with a hidden state that evolves over time. At each timestep, the input nudges the hidden state, and the output is a linear read-out of that state. The magic is that this linear recurrence can be computed either step-by-step (for autoregressive generation) or as a global convolution (for parallel training). S4 made this efficient; Mamba made it input-dependent.

Formal Setup

Definition

Continuous-Time State-Space Model

A continuous-time SSM maps input u(t)Ru(t) \in \mathbb{R} to output y(t)Ry(t) \in \mathbb{R} via a latent state x(t)RN\mathbf{x}(t) \in \mathbb{R}^N:

x(t)=Ax(t)+Bu(t)\mathbf{x}'(t) = \mathbf{A}\mathbf{x}(t) + \mathbf{B}u(t) y(t)=Cx(t)+Du(t)y(t) = \mathbf{C}\mathbf{x}(t) + Du(t)

where ARN×N\mathbf{A} \in \mathbb{R}^{N \times N}, BRN×1\mathbf{B} \in \mathbb{R}^{N \times 1}, CR1×N\mathbf{C} \in \mathbb{R}^{1 \times N}, DRD \in \mathbb{R}.

Definition

Discretization

For discrete sequences with step size Δ\Delta, the continuous SSM is discretized (e.g., via zero-order hold):

Aˉ=exp(ΔA),Bˉ=(ΔA)1(exp(ΔA)I)ΔB\bar{\mathbf{A}} = \exp(\Delta \mathbf{A}), \quad \bar{\mathbf{B}} = (\Delta \mathbf{A})^{-1}(\exp(\Delta \mathbf{A}) - \mathbf{I}) \cdot \Delta \mathbf{B}

The discrete recurrence becomes:

xk=Aˉxk1+Bˉuk\mathbf{x}_k = \bar{\mathbf{A}}\mathbf{x}_{k-1} + \bar{\mathbf{B}}u_k yk=Cxk+Duky_k = \mathbf{C}\mathbf{x}_k + Du_k

Main Theorems

Theorem

SSM as Global Convolution

Statement

The output sequence of a discrete LTI state-space model can be computed as a global convolution:

yk=j=0kCAˉjBˉukj=(Kˉu)ky_k = \sum_{j=0}^{k} \mathbf{C}\bar{\mathbf{A}}^j\bar{\mathbf{B}}\, u_{k-j} = (\bar{K} * u)_k

where the convolution kernel is Kˉ=(CBˉ,  CAˉBˉ,  CAˉ2Bˉ,  )\bar{K} = (\mathbf{C}\bar{\mathbf{B}},\; \mathbf{C}\bar{\mathbf{A}}\bar{\mathbf{B}},\; \mathbf{C}\bar{\mathbf{A}}^2\bar{\mathbf{B}},\; \ldots). (This formula assumes no direct feedthrough, i.e., D=0D = 0, which is the standard S4 setup; the direct term DukDu_k would add separately in the general LTI case.)

Intuition

Unrolling the recurrence xk=Aˉxk1+Bˉuk\mathbf{x}_k = \bar{\mathbf{A}}\mathbf{x}_{k-1} + \bar{\mathbf{B}}u_k shows that the output at time kk is a weighted sum of all past inputs, with weights given by powers of Aˉ\bar{\mathbf{A}}. This is exactly a convolution. During training, you can compute this convolution in O(nlogn)O(n \log n) time via FFT. During inference, you use the recurrence directly in O(1)O(1) per step.

Why It Matters

This dual view. recurrence for inference, convolution for training. is why SSMs are practical. Training via convolution is parallelizable (like transformers). Inference via recurrence is O(1)O(1) per token (like RNNs, unlike transformers which must attend to all previous tokens).

Failure Mode

The convolution view requires the parameters (A,B,C)(\mathbf{A}, \mathbf{B}, \mathbf{C}) to be time-invariant (the same for every timestep). Mamba breaks this assumption by making parameters input-dependent, which is exactly what makes it powerful. And prevents direct use of the convolution view.

Theorem

HiPPO: Optimal Polynomial Memory

Statement

The HiPPO (High-Order Polynomial Projection Operator) framework derives the matrix A\mathbf{A} such that the state x(t)\mathbf{x}(t) optimally represents the history of the input as a projection onto a polynomial basis. For HiPPO-LegS, the measure is uniform over [0,t][0, t] (scaled Legendre polynomials on the cumulative history; see Gu, Dao, Ermon, Rudra, Re, "HiPPO: Recurrent Memory with Optimal Polynomial Projections," NeurIPS 2020, Theorem 2 / eq. (5)). The alternative HiPPO-LegT uses a uniform sliding-window measure over [tθ,t][t-\theta, t] and yields a different matrix. For LegS:

Ank={(2n+1)1/2(2k+1)1/2if n>kn+1if n=k0if n<kA_{nk} = -\begin{cases} (2n+1)^{1/2}(2k+1)^{1/2} & \text{if } n > k \\ n+1 & \text{if } n = k \\ 0 & \text{if } n < k \end{cases}

With this A\mathbf{A}, the state x(t)\mathbf{x}(t) contains the coefficients of the optimal degree-(N1)(N-1) polynomial approximation to the input history.

Intuition

The hidden state of the SSM is a compressed summary of the input history. The HiPPO matrix is the provably optimal choice for preserving as much information about the history as possible in NN dimensions, where "information" is measured as polynomial approximation quality under a specific measure.

Why It Matters

Before HiPPO, SSMs used random or heuristic initialization for A\mathbf{A} and could not model long-range dependencies. HiPPO provided the principled initialization for SSMs. S4's contribution was the DPLR (diagonal plus low-rank) parameterization and the Cauchy-kernel FFT computation that made HiPPO-initialized SSMs practical at scale (Gu, Goel, Re, ICLR 2022, Section 3).

Proposition

Selective State-Space Model (Mamba)

Statement

Mamba modifies the standard SSM by making B\mathbf{B}, C\mathbf{C}, and Δ\Delta functions of the input:

Bk=LinearB(xk),Ck=LinearC(xk),Δk=softplus(LinearΔ(xk))\mathbf{B}_k = \text{Linear}_B(\mathbf{x}_k), \quad \mathbf{C}_k = \text{Linear}_C(\mathbf{x}_k), \quad \Delta_k = \text{softplus}(\text{Linear}_\Delta(\mathbf{x}_k))

This makes the model input-dependent (selective): the dynamics change at every timestep based on the input, allowing the model to decide what to remember and what to forget as a function of the current token.

Intuition

A standard SSM applies the same dynamics to every input. It cannot decide that a particular token is important and should be remembered, or that another is noise and should be forgotten. Mamba's selectivity is analogous to the gating mechanism in LSTMs, but operating within the SSM framework. The input-dependent Δ\Delta controls the "timescale". large Δ\Delta means "pay attention to this input," small Δ\Delta means "coast on the current state."

Why It Matters

Selectivity is what made SSMs competitive with transformers on language. Without it, SSMs could model continuous signals (audio, time series) but struggled with discrete, information-dense sequences (text) where the model must selectively attend to specific tokens. The selective scan is Mamba's core contribution.

Failure Mode

Input-dependent parameters break the convolution view. Mamba compensates with a hardware-aware "selective scan" algorithm that computes the recurrence efficiently on GPUs using parallel scan primitives. This is fast in practice but more complex to implement than the FFT-based convolution of S4.

Complexity Comparison

OperationTransformerS4 (LTI-SSM)Mamba (Selective SSM)
Training (length nn)O(n2d)O(n^2 d)O(nlognd)O(n \log n \cdot d)O(ndN)O(n \cdot d \cdot N)
Inference per tokenO(nd)O(n \cdot d)O(dN)O(d \cdot N)O(dN)O(d \cdot N)
State size per layerO(nd)O(n \cdot d) (KV cache)O(dN)O(d \cdot N)O(dN)O(d \cdot N)

where dd is the model dimension and NN is the SSM state dimension (typically N=16N = 16). The O(n)O(n) vs O(n2)O(n^2) gap matters enormously for long sequences.

Limitations

SSMs, including Mamba, have known weaknesses:

  1. In-context retrieval: Tasks requiring precise copying or retrieval of specific tokens from earlier in the sequence (e.g., "repeat the word at position 5000") are hard for SSMs because the hidden state is a compressed summary, not a content-addressable memory like the KV cache.

  2. Associative recall: Tasks like "what value was associated with key X?" at long range are limited by the state dimension NN. Arora, Eyuboglu, Timalsina, Johnson, Poli, Zou, Rudra, Re (2023, "Zoology: Measuring and Improving Recall in Efficient Language Models," arXiv:2312.04927) show that SSMs with state dimension NN need NΩ(K)N \geq \Omega(K) to reliably retrieve KK distinct (key, value) pairs.

  3. Induction heads: Certain attention patterns (induction heads) that are crucial for in-context learning in transformers have no obvious SSM analog.

What Aged Badly

Early claims (2023-2024) that SSMs would "replace transformers" were premature. As of 2025, the trajectory points toward hybrid architectures: models that combine SSM layers (for efficient long-range processing) with attention layers (for precise retrieval and in-context learning). Jamba (AI21), Zamba, and several research models follow this hybrid pattern. Pure SSM models have not matched frontier transformer performance on general language benchmarks.

Common Confusions

Watch Out

SSMs are not just 'fast RNNs'

While SSMs share the recurrent structure of RNNs, the key difference is the structured state matrix A\mathbf{A} (HiPPO) and the dual recurrence/convolution view. Vanilla RNNs have unstructured hidden states that lose long-range information exponentially. SSMs with HiPPO initialization provably maintain polynomial approximations of the input history. The mathematical framework is structurally different.

Watch Out

Linear dynamics does not mean linear model

The state dynamics are linear (xk=Aˉxk1+Bˉuk\mathbf{x}_k = \bar{\mathbf{A}}\mathbf{x}_{k-1} + \bar{\mathbf{B}}u_k), but the overall model is nonlinear: Mamba wraps the SSM in nonlinear activations, gating, and input-dependent parameterization. The linearity of the core dynamics enables efficient computation; the surrounding nonlinearity provides expressiveness.

Summary

  • SSMs model sequences via linear state dynamics: recurrence for inference, convolution for training
  • HiPPO initialization provides provably optimal long-range memory
  • S4 = structured state spaces + HiPPO + efficient FFT computation
  • Mamba = S4 + input-dependent (selective) parameters. The model learns what to remember and what to forget
  • O(n)O(n) in sequence length vs O(n2)O(n^2) for standard attention
  • Weakness: precise in-context retrieval is harder without content-addressable memory
  • Hybrid SSM-attention architectures are the likely convergence point

Exercises

ExerciseCore

Problem

For an LTI SSM with state dimension N=16N = 16 and sequence length n=100,000n = 100{,}000, compare the per-token inference cost to a transformer with hidden dimension d=4096d = 4096. Which is cheaper, and by how much?

ExerciseAdvanced

Problem

Explain why making B\mathbf{B} and C\mathbf{C} input-dependent (as in Mamba) breaks the global convolution view of SSMs. What computational technique does Mamba use instead?

ExerciseResearch

Problem

Hybrid models (e.g., Jamba) interleave SSM layers and attention layers. From first principles, argue which layers in a deep network should be SSM layers and which should be attention layers. What tasks or capabilities does each layer type serve?

DeltaNet and the Delta Rule

DeltaNet (Schlag et al., 2021; Yang et al., 2024) applies the classical delta rule from associative memory to sequence modeling. Where Mamba uses a diagonal state matrix Aˉ\bar{\mathbf{A}}, DeltaNet uses a dense state update that writes key-value associations into a matrix memory:

Mt=Mt1+vtkt(Mt1kt)ktM_t = M_{t-1} + v_t k_t^\top - (M_{t-1} k_t) k_t^\top

The first term writes a new association. The second term erases the old value associated with key ktk_t before writing the new one. This is the delta rule: update by the error between the desired value vtv_t and the current retrieval Mt1ktM_{t-1} k_t. (Shown here with gate βt=1\beta_t = 1; the full DeltaNet of Yang et al. 2024, arXiv:2406.06484, includes a per-token gate βt(0,1]\beta_t \in (0, 1] giving St=St1+βt(vtSt1kt)ktS_t = S_{t-1} + \beta_t (v_t - S_{t-1} k_t) k_t^\top, which interpolates between no update and full replacement.)

DeltaNet can be seen as a linear attention variant where the state matrix MRd×dM \in \mathbb{R}^{d \times d} serves as an associative memory. Unlike Mamba's diagonal state (which compresses information into NN scalars), DeltaNet's matrix state can store and retrieve key-value pairs, giving it stronger in-context retrieval ability.

The cost is O(d2)O(d^2) state size per layer (vs Mamba's O(dN)O(dN) with NdN \ll d). DeltaNet occupies a middle ground between SSMs (cheap state, weak retrieval) and attention (expensive but perfect retrieval).

Related Comparisons

References

Canonical:

  • Gu, Goel, Re, "Efficiently Modeling Long Sequences with Structured State Spaces" (ICLR 2022) [S4]
  • Gu & Dao, "Mamba: Linear-Time Sequence Modeling with Selective State Spaces" (2023)

Current:

  • Dao & Gu, "Transformers are SSMs: Generalized Models and Efficient Algorithms through Structured State Space Duality" (2024) [Mamba-2]
  • Lieber et al., "Jamba: A Hybrid Transformer-Mamba Language Model" (2024)
  • Yang et al., "Parallelizing Linear Transformers with the Delta Rule over Sequence Length" (NeurIPS 2024). DeltaNet.
  • Gu, Dao, Ermon, Rudra, Re, "HiPPO: Recurrent Memory with Optimal Polynomial Projections" (NeurIPS 2020).
  • Arora, Eyuboglu, Timalsina, Johnson, Poli, Zou, Rudra, Re, "Zoology: Measuring and Improving Recall in Efficient Language Models" (2023, arXiv:2312.04927).

Next Topics

The natural next steps from state-space models:

  • Mixture of experts: another approach to scaling efficiency, often combined with SSMs in hybrid architectures
  • Context engineering: managing what the model sees, especially relevant when the architecture has limited retrieval capacity

Last reviewed: April 2026

Prerequisites

Foundations this topic depends on.

Next Topics