Beyond Llms
Mamba and State-Space Models
Linear-time sequence modeling via structured state spaces: S4, HiPPO initialization, selective state-space models (Mamba), and the architectural fork from transformers.
Prerequisites
Why This Matters
Transformers scale quadratically in sequence length: self-attention over tokens costs time and memory. For long sequences. books, genomics data, high-resolution audio. This is prohibitive. State-space models (SSMs) achieve scaling by processing sequences through a structured linear recurrence, and Mamba made SSMs competitive with transformers on language tasks.
Mamba is the most compute-efficient alternative to date that reaches transformer-level performance across broad language benchmarks. Earlier alternatives in the same line include Linear Transformers (Katharopoulos et al., 2020), Performer (Choromanski et al., 2020), RWKV (Peng et al., 2023, arXiv:2305.13048), and RetNet (Sun et al., 2023, arXiv:2307.08621), all of which preceded Mamba. Whether SSMs replace, complement, or merge with transformers is an open question, but understanding their mathematics is no longer optional.
Mental Model
Think of an SSM as a linear dynamical system with a hidden state that evolves over time. At each timestep, the input nudges the hidden state, and the output is a linear read-out of that state. The magic is that this linear recurrence can be computed either step-by-step (for autoregressive generation) or as a global convolution (for parallel training). S4 made this efficient; Mamba made it input-dependent.
Formal Setup
Continuous-Time State-Space Model
A continuous-time SSM maps input to output via a latent state :
where , , , .
Discretization
For discrete sequences with step size , the continuous SSM is discretized (e.g., via zero-order hold):
The discrete recurrence becomes:
Main Theorems
SSM as Global Convolution
Statement
The output sequence of a discrete LTI state-space model can be computed as a global convolution:
where the convolution kernel is . (This formula assumes no direct feedthrough, i.e., , which is the standard S4 setup; the direct term would add separately in the general LTI case.)
Intuition
Unrolling the recurrence shows that the output at time is a weighted sum of all past inputs, with weights given by powers of . This is exactly a convolution. During training, you can compute this convolution in time via FFT. During inference, you use the recurrence directly in per step.
Why It Matters
This dual view. recurrence for inference, convolution for training. is why SSMs are practical. Training via convolution is parallelizable (like transformers). Inference via recurrence is per token (like RNNs, unlike transformers which must attend to all previous tokens).
Failure Mode
The convolution view requires the parameters to be time-invariant (the same for every timestep). Mamba breaks this assumption by making parameters input-dependent, which is exactly what makes it powerful. And prevents direct use of the convolution view.
HiPPO: Optimal Polynomial Memory
Statement
The HiPPO (High-Order Polynomial Projection Operator) framework derives the matrix such that the state optimally represents the history of the input as a projection onto a polynomial basis. For HiPPO-LegS, the measure is uniform over (scaled Legendre polynomials on the cumulative history; see Gu, Dao, Ermon, Rudra, Re, "HiPPO: Recurrent Memory with Optimal Polynomial Projections," NeurIPS 2020, Theorem 2 / eq. (5)). The alternative HiPPO-LegT uses a uniform sliding-window measure over and yields a different matrix. For LegS:
With this , the state contains the coefficients of the optimal degree- polynomial approximation to the input history.
Intuition
The hidden state of the SSM is a compressed summary of the input history. The HiPPO matrix is the provably optimal choice for preserving as much information about the history as possible in dimensions, where "information" is measured as polynomial approximation quality under a specific measure.
Why It Matters
Before HiPPO, SSMs used random or heuristic initialization for and could not model long-range dependencies. HiPPO provided the principled initialization for SSMs. S4's contribution was the DPLR (diagonal plus low-rank) parameterization and the Cauchy-kernel FFT computation that made HiPPO-initialized SSMs practical at scale (Gu, Goel, Re, ICLR 2022, Section 3).
Selective State-Space Model (Mamba)
Statement
Mamba modifies the standard SSM by making , , and functions of the input:
This makes the model input-dependent (selective): the dynamics change at every timestep based on the input, allowing the model to decide what to remember and what to forget as a function of the current token.
Intuition
A standard SSM applies the same dynamics to every input. It cannot decide that a particular token is important and should be remembered, or that another is noise and should be forgotten. Mamba's selectivity is analogous to the gating mechanism in LSTMs, but operating within the SSM framework. The input-dependent controls the "timescale". large means "pay attention to this input," small means "coast on the current state."
Why It Matters
Selectivity is what made SSMs competitive with transformers on language. Without it, SSMs could model continuous signals (audio, time series) but struggled with discrete, information-dense sequences (text) where the model must selectively attend to specific tokens. The selective scan is Mamba's core contribution.
Failure Mode
Input-dependent parameters break the convolution view. Mamba compensates with a hardware-aware "selective scan" algorithm that computes the recurrence efficiently on GPUs using parallel scan primitives. This is fast in practice but more complex to implement than the FFT-based convolution of S4.
Complexity Comparison
| Operation | Transformer | S4 (LTI-SSM) | Mamba (Selective SSM) |
|---|---|---|---|
| Training (length ) | |||
| Inference per token | |||
| State size per layer | (KV cache) |
where is the model dimension and is the SSM state dimension (typically ). The vs gap matters enormously for long sequences.
Limitations
SSMs, including Mamba, have known weaknesses:
-
In-context retrieval: Tasks requiring precise copying or retrieval of specific tokens from earlier in the sequence (e.g., "repeat the word at position 5000") are hard for SSMs because the hidden state is a compressed summary, not a content-addressable memory like the KV cache.
-
Associative recall: Tasks like "what value was associated with key X?" at long range are limited by the state dimension . Arora, Eyuboglu, Timalsina, Johnson, Poli, Zou, Rudra, Re (2023, "Zoology: Measuring and Improving Recall in Efficient Language Models," arXiv:2312.04927) show that SSMs with state dimension need to reliably retrieve distinct (key, value) pairs.
-
Induction heads: Certain attention patterns (induction heads) that are crucial for in-context learning in transformers have no obvious SSM analog.
Early claims (2023-2024) that SSMs would "replace transformers" were premature. As of 2025, the trajectory points toward hybrid architectures: models that combine SSM layers (for efficient long-range processing) with attention layers (for precise retrieval and in-context learning). Jamba (AI21), Zamba, and several research models follow this hybrid pattern. Pure SSM models have not matched frontier transformer performance on general language benchmarks.
Common Confusions
SSMs are not just 'fast RNNs'
While SSMs share the recurrent structure of RNNs, the key difference is the structured state matrix (HiPPO) and the dual recurrence/convolution view. Vanilla RNNs have unstructured hidden states that lose long-range information exponentially. SSMs with HiPPO initialization provably maintain polynomial approximations of the input history. The mathematical framework is structurally different.
Linear dynamics does not mean linear model
The state dynamics are linear (), but the overall model is nonlinear: Mamba wraps the SSM in nonlinear activations, gating, and input-dependent parameterization. The linearity of the core dynamics enables efficient computation; the surrounding nonlinearity provides expressiveness.
Summary
- SSMs model sequences via linear state dynamics: recurrence for inference, convolution for training
- HiPPO initialization provides provably optimal long-range memory
- S4 = structured state spaces + HiPPO + efficient FFT computation
- Mamba = S4 + input-dependent (selective) parameters. The model learns what to remember and what to forget
- in sequence length vs for standard attention
- Weakness: precise in-context retrieval is harder without content-addressable memory
- Hybrid SSM-attention architectures are the likely convergence point
Exercises
Problem
For an LTI SSM with state dimension and sequence length , compare the per-token inference cost to a transformer with hidden dimension . Which is cheaper, and by how much?
Problem
Explain why making and input-dependent (as in Mamba) breaks the global convolution view of SSMs. What computational technique does Mamba use instead?
Problem
Hybrid models (e.g., Jamba) interleave SSM layers and attention layers. From first principles, argue which layers in a deep network should be SSM layers and which should be attention layers. What tasks or capabilities does each layer type serve?
DeltaNet and the Delta Rule
DeltaNet (Schlag et al., 2021; Yang et al., 2024) applies the classical delta rule from associative memory to sequence modeling. Where Mamba uses a diagonal state matrix , DeltaNet uses a dense state update that writes key-value associations into a matrix memory:
The first term writes a new association. The second term erases the old value associated with key before writing the new one. This is the delta rule: update by the error between the desired value and the current retrieval . (Shown here with gate ; the full DeltaNet of Yang et al. 2024, arXiv:2406.06484, includes a per-token gate giving , which interpolates between no update and full replacement.)
DeltaNet can be seen as a linear attention variant where the state matrix serves as an associative memory. Unlike Mamba's diagonal state (which compresses information into scalars), DeltaNet's matrix state can store and retrieve key-value pairs, giving it stronger in-context retrieval ability.
The cost is state size per layer (vs Mamba's with ). DeltaNet occupies a middle ground between SSMs (cheap state, weak retrieval) and attention (expensive but perfect retrieval).
Related Comparisons
References
Canonical:
- Gu, Goel, Re, "Efficiently Modeling Long Sequences with Structured State Spaces" (ICLR 2022) [S4]
- Gu & Dao, "Mamba: Linear-Time Sequence Modeling with Selective State Spaces" (2023)
Current:
- Dao & Gu, "Transformers are SSMs: Generalized Models and Efficient Algorithms through Structured State Space Duality" (2024) [Mamba-2]
- Lieber et al., "Jamba: A Hybrid Transformer-Mamba Language Model" (2024)
- Yang et al., "Parallelizing Linear Transformers with the Delta Rule over Sequence Length" (NeurIPS 2024). DeltaNet.
- Gu, Dao, Ermon, Rudra, Re, "HiPPO: Recurrent Memory with Optimal Polynomial Projections" (NeurIPS 2020).
- Arora, Eyuboglu, Timalsina, Johnson, Poli, Zou, Rudra, Re, "Zoology: Measuring and Improving Recall in Efficient Language Models" (2023, arXiv:2312.04927).
Next Topics
The natural next steps from state-space models:
- Mixture of experts: another approach to scaling efficiency, often combined with SSMs in hybrid architectures
- Context engineering: managing what the model sees, especially relevant when the architecture has limited retrieval capacity
Last reviewed: April 2026
Prerequisites
Foundations this topic depends on.
- Recurrent Neural NetworksLayer 3
- Feedforward Networks and BackpropagationLayer 2
- Differentiation in RnLayer 0A
- Sets, Functions, and RelationsLayer 0A
- Basic Logic and Proof TechniquesLayer 0A
- Matrix CalculusLayer 1
- The Jacobian MatrixLayer 0A
- The Hessian MatrixLayer 0A
- Activation FunctionsLayer 1
- Convex Optimization BasicsLayer 1
- Matrix Operations and PropertiesLayer 0A
- Attention Mechanism TheoryLayer 4
- Softmax and Numerical StabilityLayer 1