State Space Models

Sneiderman, Robby

Applied Math

State Space Models

Linear state space form, the Kalman filter and RTS smoother, EM for parameter learning, and the ARIMA equivalence. The unifying framework behind classical filtering and modern Mamba/S4.

AdvancedTier 2StableCore spine~65 min

Prerequisites

Time Series Foundations Markov Chains and Steady State Kalman Filter Expectation Variance Covariance Moments

Quiz (4)Prereq Map

Why This Matters

A state space model represents a sequence as the observation of a hidden, time-evolving state. The state carries everything you need from the past; given the state at time $t$ , the future is conditionally independent of the past. This Markov factorization is what makes filtering, smoothing, and likelihood evaluation tractable.

Three reasons to care. First, every ARIMA model has a state space representation, and the Kalman filter computes the exact Gaussian likelihood in $O(n)$ time, including with missing observations — try doing that with a moving-average regression. Second, state space is the language of control, robotics, finance, and target tracking; the Kalman filter is the workhorse of inertial navigation and GPS fusion. Third, the recent generation of sequence models (S4, Mamba) is literally a discretized linear state space model with learned dynamics, and the connection matters for understanding their inductive bias and computational properties.

State Space Form

Definition

Linear Gaussian State Space Model $x_{t}, y_{t}$

A linear Gaussian state space model is a pair of equations:

$x_t = A_t x_{t-1} + B_t u_t + w_t, \qquad w_t \sim \mathcal{N}(0, Q_t),$

$y_t = C_t x_t + v_t, \qquad v_t \sim \mathcal{N}(0, R_t),$

with initial state $x_0 \sim \mathcal{N}(\mu_0, \Sigma_0)$ . The first is the state equation (or transition equation); $x_t \in \mathbb{R}^n$ is the hidden state, $A_t$ is the transition matrix, $u_t$ is a known control input, $w_t$ is process noise. The second is the observation equation (or measurement equation); $y_t \in \mathbb{R}^m$ is the observation, $C_t$ is the observation matrix, $v_t$ is measurement noise. The noise sequences $\{w_t\}, \{v_t\}$ are independent of each other and of $x_0$ .

The matrices $A_t, B_t, C_t, Q_t, R_t$ may be time-varying; we write $A, B, C, Q, R$ when they are time-invariant. The model has Markov structure: $p(x_t | x_{0:t-1}, y_{1:t-1}) = p(x_t | x_{t-1}, u_t)$ , and observations are conditionally independent given the state.

The Filtering Problem

Filtering computes $p(x_t | y_{1:t})$ , the posterior over the current state given observations up to now. Under the linear Gaussian assumptions this posterior is itself Gaussian; let $\hat{x}_{t|t} = \mathbb{E}[x_t | y_{1:t}]$ and $P_{t|t} = \text{Cov}(x_t | y_{1:t})$ .

The Kalman filter alternates two steps.

Prediction step (push the posterior forward through the dynamics):

$\hat{x}_{t|t-1} = A \hat{x}_{t-1|t-1} + B u_t,$

$P_{t|t-1} = A P_{t-1|t-1} A^\top + Q.$

Update step (incorporate observation $y_t$ ):

$y_t \text{ predicted: } \hat{y}_{t|t-1} = C \hat{x}_{t|t-1}, \qquad \text{innovation: } \nu_t = y_t - \hat{y}_{t|t-1},$

$S_t = C P_{t|t-1} C^\top + R, \qquad K_t = P_{t|t-1} C^\top S_t^{-1},$

$\hat{x}_{t|t} = \hat{x}_{t|t-1} + K_t \nu_t, \qquad P_{t|t} = (I - K_t C) P_{t|t-1}.$

$K_t$ is the Kalman gain. The detailed Kalman filter page covers the recursion, examples, and the EKF/UKF extensions; here we focus on the gain derivation and the smoother, EM, and ARIMA equivalence.

Deriving the Kalman Gain

The clean way to derive $K_t$ is by minimizing the trace of $P_{t|t}$ , which is the posterior MSE.

Theorem

Kalman Gain Minimizes Posterior MSE

Statement

Among all linear estimators of the form $\hat{x}_{t|t} = \hat{x}_{t|t-1} + K(y_t - C \hat{x}_{t|t-1})$ , the unique gain matrix that minimizes $\text{tr}(P_{t|t}) = \mathbb{E}[\| x_t - \hat{x}_{t|t} \|^2 | y_{1:t-1}]$ is $K^\star = P_{t|t-1} C^\top (C P_{t|t-1} C^\top + R)^{-1}.$ The minimum posterior covariance is $P_{t|t} = (I - K^\star C) P_{t|t-1}$ .

Intuition

The innovation $\nu_t = C(x_t - \hat{x}_{t|t-1}) + v_t$ has covariance $S_t = C P_{t|t-1} C^\top + R$ . The Kalman gain weights the innovation by the cross-covariance between state error and innovation, $P_{t|t-1} C^\top$ , divided by the innovation's own covariance. This is the standard regression coefficient of state error on innovation.

Proof Sketch

Let $e = x_t - \hat{x}_{t|t-1}$ be the prediction error, with $\mathbb{E}[e | y_{1:t-1}] = 0$ and $\text{Cov}(e | y_{1:t-1}) = P_{t|t-1}$ . Then $\nu_t = Ce + v_t$ where $v_t$ is independent of $e$ . After updating with gain $K$ , the residual is $e - K \nu_t = e - K(Ce + v_t) = (I - KC)e - K v_t$ , so

$P_{t|t}(K) = (I - KC) P_{t|t-1} (I - KC)^\top + K R K^\top.$

This is a quadratic form in $K$ . Taking $\partial \text{tr}(P_{t|t}) / \partial K = 0$ gives $-2(I - KC) P_{t|t-1} C^\top + 2 K R = 0,$ which rearranges to $K (CP_{t|t-1}C^\top + R) = P_{t|t-1} C^\top$ , hence $K^\star = P_{t|t-1} C^\top S_t^{-1}$ .

Substituting $K^\star$ back: $P_{t|t} = (I - K^\star C) P_{t|t-1} (I - K^\star C)^\top + K^\star R (K^\star)^\top = (I - K^\star C) P_{t|t-1}$ , where the last equality uses $K^\star R = (I - K^\star C) P_{t|t-1} C^\top$ (read off from the gradient condition).

Why It Matters

The MSE-minimization derivation does not require Gaussianity. Among linear estimators of $x_t$ from $y_{1:t}$ , the Kalman filter is optimal regardless of the noise distribution. Under Gaussianity it is also the conditional expectation (the global MMSE estimator); under non-Gaussian noise it is still the best linear unbiased estimator (BLUE). This is why the filter is so robust in engineering practice.

Failure Mode

$K_t$ is computed from model matrices, not from data. If $Q$ or $R$ is misspecified, the filter is consistent but suboptimal: it weights innovations incorrectly and reports an overconfident covariance. If the dynamics are nonlinear or the noise is heavy-tailed, the linear estimator is no longer MMSE; the EKF/UKF/particle filter use linearization or sampling to approximate the true posterior. If $P_{t|t-1}$ becomes near-singular numerically, use the Joseph form $P_{t|t} = (I - KC) P_{t|t-1}(I - KC)^\top + K R K^\top$ , which is symmetric and positive definite by construction.

report a correction →

The RTS Smoother

Smoothing computes $p(x_t | y_{1:T})$ for $t < T$ : the posterior over a past state given the entire trajectory of observations. Smoothed estimates are tighter than filtered ones because future observations carry information about past states.

The Rauch-Tung-Striebel (RTS) smoother runs the Kalman filter forward to obtain $\{\hat{x}_{t|t}, P_{t|t}\}_{t=1}^T$ and $\{\hat{x}_{t|t-1}, P_{t|t-1}\}_{t=1}^T$ , then sweeps backward.

Theorem

Rauch-Tung-Striebel Smoother

Statement

Define the smoothed mean $\hat{x}_{t|T}$ and covariance $P_{t|T}$ by the backward recursion, starting from $\hat{x}_{T|T}, P_{T|T}$ (the final filter output):

$J_t = P_{t|t} A^\top P_{t+1|t}^{-1},$

$\hat{x}_{t|T} = \hat{x}_{t|t} + J_t (\hat{x}_{t+1|T} - \hat{x}_{t+1|t}),$

$P_{t|T} = P_{t|t} + J_t (P_{t+1|T} - P_{t+1|t}) J_t^\top.$

These equal the Gaussian posterior $p(x_t | y_{1:T})$ .

Intuition

$J_t$ is a backward Kalman gain: the regression coefficient of $x_t$ on $x_{t+1}$ given $y_{1:t}$ , computed from the joint Gaussian $\begin{pmatrix} x_t \\ x_{t+1} \end{pmatrix} \Big| y_{1:t} \sim \mathcal{N}\left(\begin{pmatrix} \hat{x}_{t|t} \\ \hat{x}_{t+1|t} \end{pmatrix}, \begin{pmatrix} P_{t|t} & P_{t|t} A^\top \\ A P_{t|t} & P_{t+1|t} \end{pmatrix}\right).$ Conditioning on $x_{t+1}$ gives the Gaussian regression formula; replacing $x_{t+1}$ with its smoothed mean and absorbing its smoothed covariance produces the RTS update.

Proof Sketch

Use the Markov property of the smoothed posterior: $p(x_t | y_{1:T}) = \int p(x_t | x_{t+1}, y_{1:t}) p(x_{t+1} | y_{1:T}) dx_{t+1}$ . The first factor is Gaussian (joint Gaussian conditioning) with mean $\hat{x}_{t|t} + J_t(x_{t+1} - \hat{x}_{t+1|t})$ and covariance $P_{t|t} - J_t A P_{t|t}$ . Marginalizing over $x_{t+1} \sim \mathcal{N}(\hat{x}_{t+1|T}, P_{t+1|T})$ recovers the RTS formulas.

Why It Matters

Smoothing matters whenever the offline trajectory is what you care about: post-hoc trajectory reconstruction, EM-based parameter learning (which needs $\mathbb{E}[x_t | y_{1:T}]$ in the E-step), and any setting where you can afford to wait for future observations before reporting.

Failure Mode

The RTS smoother inherits all model-misspecification issues from the Kalman filter. It also needs $P_{t+1|t}$ invertible at every step; if the dynamics matrix $A$ is singular, $P_{t+1|t}$ may be rank-deficient and you should switch to the information-form smoother or a square-root implementation.

report a correction →

EM for Parameter Estimation

When the model matrices $\theta = (A, C, Q, R, \mu_0, \Sigma_0)$ are unknown, the EM algorithm of Shumway and Stoffer (1982) alternates filtering/smoothing with closed-form M-steps.

E-step. Given current $\theta^{(k)}$ , run the Kalman filter and RTS smoother to compute $\hat{x}_{t|T}$ , $P_{t|T}$ , and the lag-one smoothed cross-covariance $P_{t,t-1|T}$ .

M-step. Maximize the expected complete-data log-likelihood. The complete-data log-likelihood is $\log p(x_{0:T}, y_{1:T} | \theta) = \log p(x_0) + \sum_{t=1}^T \log p(x_t | x_{t-1}) + \sum_{t=1}^T \log p(y_t | x_t),$ which decomposes into Gaussian likelihoods quadratic in the parameters. Closed-form updates:

$A^{(k+1)} = \left(\sum_{t=1}^T \mathbb{E}[x_t x_{t-1}^\top]\right)\left(\sum_{t=1}^T \mathbb{E}[x_{t-1} x_{t-1}^\top]\right)^{-1},$

with analogous updates for $C, Q, R$ . The expectations $\mathbb{E}[x_t x_{t-1}^\top | y_{1:T}]$ come from the smoother. EM monotonically increases the marginal likelihood $p(y_{1:T} | \theta)$ .

In practice, EM for state space models converges slowly near the optimum and is sensitive to local minima with many states. Direct gradient descent on the marginal log-likelihood (computed via the prediction-error decomposition $\log p(y_{1:T}) = \sum_t \log \mathcal{N}(y_t; \hat{y}_{t|t-1}, S_t)$ ) often converges faster.

ARIMA as State Space

Theorem

ARIMA-State Space Equivalence

Statement

Every ARIMA( $p, d, q$ ) process admits a linear Gaussian state space representation with state dimension $r = \max(p+d, q+1)$ . Conversely, every linear time-invariant state space model with scalar observation has an ARMA representation in the observation series.

Intuition

The state space form gives ARMA's recurrence a Markov reformulation: pack enough lags into the state vector to make the next state computable from the current one. Conversely, eliminating the state from $(y_t = Cx_t + v_t, x_t = Ax_{t-1} + w_t)$ via the lag operator gives a rational transfer function, which is exactly what ARMA encodes.

Proof Sketch

Forward direction (Harvey 1989, Chapter 4). Consider ARMA( $p, q$ ) with $r = \max(p, q+1)$ . Define the state $x_t = (X_t, X_{t+1|t}, \ldots, X_{t+r-1|t})^\top$ where $X_{t+k|t}$ is the $k$ -step-ahead linear forecast. Then $x_{t+1} = A x_t + b \epsilon_{t+1}, \quad y_t = c^\top x_t,$ where $A$ has companion-matrix structure with the AR coefficients on the bottom row, $b$ contains the MA coefficients (with $b_0 = 1$ ), and $c = (1, 0, \ldots, 0)^\top$ . For ARIMA, prepend $d$ accumulator states.

Reverse direction. Eliminate $x_t$ from the state equations: $C(I - AL)^{-1} B$ in the lag operator gives a rational function $\Theta(L)/\Phi(L)$ , which is the ARMA transfer function on the noise.

Why It Matters

The state space form lets you compute the Gaussian likelihood of an ARIMA model in $O(n)$ via the Kalman filter, handle missing observations natively (just skip the update step), and incorporate exogenous regressors as control inputs. It is also how arima() in R and statsmodels in Python actually fit ARIMA: they run a Kalman filter on the state space form.

Failure Mode

The state dimension $r = \max(p+d, q+1)$ grows with $p+d$ , which can make the filter expensive for high-order models. The state space representation is non-unique (any invertible linear transformation of the state gives an equivalent model); this redundancy is irrelevant for the likelihood but matters when interpreting the state directly.

report a correction →

Connection to Modern SSMs (S4, Mamba)

Continuous-time linear state space models $\dot{x}(t) = Ax(t) + Bu(t),\ y(t) = Cx(t) + Du(t)$ are the language of control theory. Discretizing them gives a discrete linear recurrence $x_t = \bar{A} x_{t-1} + \bar{B} u_t,\ y_t = \bar{C} x_t$ . This is exactly the form used by S4 (Gu et al. 2022), where $A$ is parametrized via the HiPPO matrix to give long-range memory, and Mamba (Gu and Dao 2024), which makes $\bar{A}, \bar{B}, \bar{C}$ input-dependent.

The connection is structural, not coincidental: a learned linear SSM is doing the same thing the Kalman filter does (recursively summarize past observations into a fixed-size state) with a different parameterization. The classical theory tells you what the inductive bias of these architectures is — they encode bandlimited or rational transfer functions, with limits on what kinds of dependencies they can capture without nonlinearity. See the Mamba and state space models page for the architectural detail.

Common Confusions

Watch Out

The state is not unique

Two different state space models with state vectors related by an invertible linear transformation $\tilde{x}_t = T x_t$ produce identical observation distributions. The state matrices change accordingly: $\tilde{A} = T A T^{-1},\ \tilde{C} = C T^{-1}$ . This nonidentifiability is harmless for prediction and likelihood evaluation but fatal for interpretation. If you want the state to mean something physical (position, velocity), parametrize directly; if you only care about $y_t$ , fit any equivalent representation.

Watch Out

P_{t|t} does not depend on the data

The covariance recursion $P_{t|t} = (I - K_t C) P_{t|t-1}$ depends only on $A, C, Q, R$ . The actual observations $y_t$ enter only through the mean update. Consequently you can precompute $\{P_{t|t}, K_t\}$ before seeing any data; the filter's "uncertainty trajectory" is a function of the model alone. Steady-state Kalman filters exploit this by precomputing $K_\infty$ from the discrete algebraic Riccati equation.

Watch Out

Smoothing is not just running the filter backward

The RTS smoother is a forward filter followed by a backward sweep that uses smoothed future estimates. It is not the Kalman filter applied to the time-reversed series, because reversing the dynamics requires a different state-transition matrix and noise covariance derived from the original. The two-filter form (information filter forward + information filter backward, then combine) is an alternative implementation, but again with carefully derived backward dynamics.

Summary

State space form has a hidden Markov state $x_t$ and a noisy observation $y_t = Cx_t + v_t$ . The state carries everything needed from the past.
Kalman gain $K_t = P_{t|t-1} C^\top S_t^{-1}$ minimizes the trace of $P_{t|t}$ . The derivation does not require Gaussianity; it gives BLUE in general.
The RTS smoother sweeps backward to compute $p(x_t | y_{1:T})$ , tighter than the filtered $p(x_t | y_{1:t})$ .
EM with E-step = filter + smoother and closed-form M-step learns $(A, C, Q, R)$ from data; gradient descent on the prediction-error decomposition is often faster.
Every ARIMA has a state space representation, enabling $O(n)$ likelihood computation via the Kalman filter.
Modern sequence models S4 and Mamba are discretized linear SSMs with learned dynamics; the classical theory describes their inductive bias.

Exercises

ExerciseCore

Problem

A scalar local-level model has $x_t = x_{t-1} + w_t$ , $w_t \sim \mathcal{N}(0, q)$ , and $y_t = x_t + v_t$ , $v_t \sim \mathcal{N}(0, r)$ . Show that the steady-state Kalman gain is $K_\infty = \frac{-q + \sqrt{q^2 + 4qr}}{2r}.$ Show that $K_\infty \to 1$ as $q/r \to \infty$ and $K_\infty \to 0$ as $q/r \to 0$ .

ExerciseAdvanced

Problem

Write down a state space representation of the AR(2) model $X_t = \phi_1 X_{t-1} + \phi_2 X_{t-2} + \epsilon_t$ . Verify that the observation series $y_t = X_t$ recovers the original AR(2) recursion.

ExerciseAdvanced

Problem

Suppose you fit a Kalman filter with the wrong process noise covariance $\tilde Q = \alpha Q$ for some $\alpha > 0$ , but the correct observation matrices. Show that the filter is still BLUE (best linear unbiased) but its reported covariance $\tilde P_{t|t}$ is wrong. What does the filter overconfidence look like as a function of $\alpha$ ?

References

Canonical:

Kalman, R. E. "A New Approach to Linear Filtering and Prediction Problems." Journal of Basic Engineering, 82(1), 1960. The original.
Anderson, B. D. O., Moore, J. B. Optimal Filtering. Prentice-Hall, 1979, Chapters 3-7.
Harvey, A. C. Forecasting, Structural Time Series Models and the Kalman Filter. Cambridge University Press, 1989, Chapters 3-4.
Hamilton, J. D. Time Series Analysis, Princeton University Press, 1994, Chapter 13.
Rauch, H. E., Tung, F., Striebel, C. T. "Maximum Likelihood Estimates of Linear Dynamic Systems." AIAA Journal, 3(8), 1965 (RTS smoother).

Current:

Sarkka, S. Bayesian Filtering and Smoothing. Cambridge University Press, 2013, Chapters 4-6.
Shumway, R. H., Stoffer, D. S. Time Series Analysis and Its Applications, 4th ed., Springer, 2017, Chapter 6.
Murphy, K. P. Probabilistic Machine Learning: Advanced Topics, MIT Press, 2023, Chapter 8 (state space models and SSMs as sequence models).
Gu, A., Goel, K., Re, C. "Efficiently Modeling Long Sequences with Structured State Spaces (S4)," arXiv:2111.00396, 2022.
Gu, A., Dao, T. "Mamba: Linear-Time Sequence Modeling with Selective State Spaces," arXiv:2312.00752, 2024.

Next Topics

Kalman filter: detailed treatment with EKF, UKF, and worked examples.
Particle filters: sequential Monte Carlo for nonlinear, non-Gaussian SSMs.
Mamba and state space models: how learned SSMs compete with attention.
Deep learning for time series: where neural sequence models inherit (and depart from) the SSM framework.
Time series forecasting basics: practical model selection across ARIMA, ETS, and modern alternatives.

Last reviewed: May 6, 2026

Canonical graph

Required before and derived from this topic

These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.

Full prerequisite chain All derived topics

Required prerequisites

4

Expectation, Variance, Covariance, and Momentslayer 0A · tier 1
Kalman Filterlayer 2 · tier 1
Markov Chains and Steady Statelayer 1 · tier 2
Time Series Foundationslayer 2 · tier 2

Derived topics

1

Deep Learning for Time Serieslayer 3 · tier 2

Graph-backed continuations

Deep Learning for Time Series