Skip to main content

Applied Math

State Space Models

Linear state space form, the Kalman filter and RTS smoother, EM for parameter learning, and the ARIMA equivalence. The unifying framework behind classical filtering and modern Mamba/S4.

AdvancedTier 2StableCore spine~65 min

Why This Matters

A state space model represents a sequence as the observation of a hidden, time-evolving state. The state carries everything you need from the past; given the state at time tt, the future is conditionally independent of the past. This Markov factorization is what makes filtering, smoothing, and likelihood evaluation tractable.

Three reasons to care. First, every ARIMA model has a state space representation, and the Kalman filter computes the exact Gaussian likelihood in O(n)O(n) time, including with missing observations — try doing that with a moving-average regression. Second, state space is the language of control, robotics, finance, and target tracking; the Kalman filter is the workhorse of inertial navigation and GPS fusion. Third, the recent generation of sequence models (S4, Mamba) is literally a discretized linear state space model with learned dynamics, and the connection matters for understanding their inductive bias and computational properties.

State Space Form

Definition

Linear Gaussian State Space Model

A linear Gaussian state space model is a pair of equations:

xt=Atxt1+Btut+wt,wtN(0,Qt),x_t = A_t x_{t-1} + B_t u_t + w_t, \qquad w_t \sim \mathcal{N}(0, Q_t),

yt=Ctxt+vt,vtN(0,Rt),y_t = C_t x_t + v_t, \qquad v_t \sim \mathcal{N}(0, R_t),

with initial state x0N(μ0,Σ0)x_0 \sim \mathcal{N}(\mu_0, \Sigma_0). The first is the state equation (or transition equation); xtRnx_t \in \mathbb{R}^n is the hidden state, AtA_t is the transition matrix, utu_t is a known control input, wtw_t is process noise. The second is the observation equation (or measurement equation); ytRmy_t \in \mathbb{R}^m is the observation, CtC_t is the observation matrix, vtv_t is measurement noise. The noise sequences {wt},{vt}\{w_t\}, \{v_t\} are independent of each other and of x0x_0.

The matrices At,Bt,Ct,Qt,RtA_t, B_t, C_t, Q_t, R_t may be time-varying; we write A,B,C,Q,RA, B, C, Q, R when they are time-invariant. The model has Markov structure: p(xtx0:t1,y1:t1)=p(xtxt1,ut)p(x_t | x_{0:t-1}, y_{1:t-1}) = p(x_t | x_{t-1}, u_t), and observations are conditionally independent given the state.

The Filtering Problem

Filtering computes p(xty1:t)p(x_t | y_{1:t}), the posterior over the current state given observations up to now. Under the linear Gaussian assumptions this posterior is itself Gaussian; let x^tt=E[xty1:t]\hat{x}_{t|t} = \mathbb{E}[x_t | y_{1:t}] and Ptt=Cov(xty1:t)P_{t|t} = \text{Cov}(x_t | y_{1:t}).

The Kalman filter alternates two steps.

Prediction step (push the posterior forward through the dynamics):

x^tt1=Ax^t1t1+But,\hat{x}_{t|t-1} = A \hat{x}_{t-1|t-1} + B u_t,

Ptt1=APt1t1A+Q.P_{t|t-1} = A P_{t-1|t-1} A^\top + Q.

Update step (incorporate observation yty_t):

yt predicted: y^tt1=Cx^tt1,innovation: νt=yty^tt1,y_t \text{ predicted: } \hat{y}_{t|t-1} = C \hat{x}_{t|t-1}, \qquad \text{innovation: } \nu_t = y_t - \hat{y}_{t|t-1},

St=CPtt1C+R,Kt=Ptt1CSt1,S_t = C P_{t|t-1} C^\top + R, \qquad K_t = P_{t|t-1} C^\top S_t^{-1},

x^tt=x^tt1+Ktνt,Ptt=(IKtC)Ptt1.\hat{x}_{t|t} = \hat{x}_{t|t-1} + K_t \nu_t, \qquad P_{t|t} = (I - K_t C) P_{t|t-1}.

KtK_t is the Kalman gain. The detailed Kalman filter page covers the recursion, examples, and the EKF/UKF extensions; here we focus on the gain derivation and the smoother, EM, and ARIMA equivalence.

Deriving the Kalman Gain

The clean way to derive KtK_t is by minimizing the trace of PttP_{t|t}, which is the posterior MSE.

Theorem

Kalman Gain Minimizes Posterior MSE

Statement

Among all linear estimators of the form x^tt=x^tt1+K(ytCx^tt1)\hat{x}_{t|t} = \hat{x}_{t|t-1} + K(y_t - C \hat{x}_{t|t-1}), the unique gain matrix that minimizes tr(Ptt)=E[xtx^tt2y1:t1]\text{tr}(P_{t|t}) = \mathbb{E}[\| x_t - \hat{x}_{t|t} \|^2 | y_{1:t-1}] is K=Ptt1C(CPtt1C+R)1.K^\star = P_{t|t-1} C^\top (C P_{t|t-1} C^\top + R)^{-1}. The minimum posterior covariance is Ptt=(IKC)Ptt1P_{t|t} = (I - K^\star C) P_{t|t-1}.

Intuition

The innovation νt=C(xtx^tt1)+vt\nu_t = C(x_t - \hat{x}_{t|t-1}) + v_t has covariance St=CPtt1C+RS_t = C P_{t|t-1} C^\top + R. The Kalman gain weights the innovation by the cross-covariance between state error and innovation, Ptt1CP_{t|t-1} C^\top, divided by the innovation's own covariance. This is the standard regression coefficient of state error on innovation.

Proof Sketch

Let e=xtx^tt1e = x_t - \hat{x}_{t|t-1} be the prediction error, with E[ey1:t1]=0\mathbb{E}[e | y_{1:t-1}] = 0 and Cov(ey1:t1)=Ptt1\text{Cov}(e | y_{1:t-1}) = P_{t|t-1}. Then νt=Ce+vt\nu_t = Ce + v_t where vtv_t is independent of ee. After updating with gain KK, the residual is eKνt=eK(Ce+vt)=(IKC)eKvte - K \nu_t = e - K(Ce + v_t) = (I - KC)e - K v_t, so

Ptt(K)=(IKC)Ptt1(IKC)+KRK.P_{t|t}(K) = (I - KC) P_{t|t-1} (I - KC)^\top + K R K^\top.

This is a quadratic form in KK. Taking tr(Ptt)/K=0\partial \text{tr}(P_{t|t}) / \partial K = 0 gives 2(IKC)Ptt1C+2KR=0,-2(I - KC) P_{t|t-1} C^\top + 2 K R = 0, which rearranges to K(CPtt1C+R)=Ptt1CK (CP_{t|t-1}C^\top + R) = P_{t|t-1} C^\top, hence K=Ptt1CSt1K^\star = P_{t|t-1} C^\top S_t^{-1}.

Substituting KK^\star back: Ptt=(IKC)Ptt1(IKC)+KR(K)=(IKC)Ptt1P_{t|t} = (I - K^\star C) P_{t|t-1} (I - K^\star C)^\top + K^\star R (K^\star)^\top = (I - K^\star C) P_{t|t-1}, where the last equality uses KR=(IKC)Ptt1CK^\star R = (I - K^\star C) P_{t|t-1} C^\top (read off from the gradient condition).

Why It Matters

The MSE-minimization derivation does not require Gaussianity. Among linear estimators of xtx_t from y1:ty_{1:t}, the Kalman filter is optimal regardless of the noise distribution. Under Gaussianity it is also the conditional expectation (the global MMSE estimator); under non-Gaussian noise it is still the best linear unbiased estimator (BLUE). This is why the filter is so robust in engineering practice.

Failure Mode

KtK_t is computed from model matrices, not from data. If QQ or RR is misspecified, the filter is consistent but suboptimal: it weights innovations incorrectly and reports an overconfident covariance. If the dynamics are nonlinear or the noise is heavy-tailed, the linear estimator is no longer MMSE; the EKF/UKF/particle filter use linearization or sampling to approximate the true posterior. If Ptt1P_{t|t-1} becomes near-singular numerically, use the Joseph form Ptt=(IKC)Ptt1(IKC)+KRKP_{t|t} = (I - KC) P_{t|t-1}(I - KC)^\top + K R K^\top, which is symmetric and positive definite by construction.

The RTS Smoother

Smoothing computes p(xty1:T)p(x_t | y_{1:T}) for t<Tt < T: the posterior over a past state given the entire trajectory of observations. Smoothed estimates are tighter than filtered ones because future observations carry information about past states.

The Rauch-Tung-Striebel (RTS) smoother runs the Kalman filter forward to obtain {x^tt,Ptt}t=1T\{\hat{x}_{t|t}, P_{t|t}\}_{t=1}^T and {x^tt1,Ptt1}t=1T\{\hat{x}_{t|t-1}, P_{t|t-1}\}_{t=1}^T, then sweeps backward.

Theorem

Rauch-Tung-Striebel Smoother

Statement

Define the smoothed mean x^tT\hat{x}_{t|T} and covariance PtTP_{t|T} by the backward recursion, starting from x^TT,PTT\hat{x}_{T|T}, P_{T|T} (the final filter output):

Jt=PttAPt+1t1,J_t = P_{t|t} A^\top P_{t+1|t}^{-1},

x^tT=x^tt+Jt(x^t+1Tx^t+1t),\hat{x}_{t|T} = \hat{x}_{t|t} + J_t (\hat{x}_{t+1|T} - \hat{x}_{t+1|t}),

PtT=Ptt+Jt(Pt+1TPt+1t)Jt.P_{t|T} = P_{t|t} + J_t (P_{t+1|T} - P_{t+1|t}) J_t^\top.

These equal the Gaussian posterior p(xty1:T)p(x_t | y_{1:T}).

Intuition

JtJ_t is a backward Kalman gain: the regression coefficient of xtx_t on xt+1x_{t+1} given y1:ty_{1:t}, computed from the joint Gaussian (xtxt+1)y1:tN((x^ttx^t+1t),(PttPttAAPttPt+1t)).\begin{pmatrix} x_t \\ x_{t+1} \end{pmatrix} \Big| y_{1:t} \sim \mathcal{N}\left(\begin{pmatrix} \hat{x}_{t|t} \\ \hat{x}_{t+1|t} \end{pmatrix}, \begin{pmatrix} P_{t|t} & P_{t|t} A^\top \\ A P_{t|t} & P_{t+1|t} \end{pmatrix}\right). Conditioning on xt+1x_{t+1} gives the Gaussian regression formula; replacing xt+1x_{t+1} with its smoothed mean and absorbing its smoothed covariance produces the RTS update.

Proof Sketch

Use the Markov property of the smoothed posterior: p(xty1:T)=p(xtxt+1,y1:t)p(xt+1y1:T)dxt+1p(x_t | y_{1:T}) = \int p(x_t | x_{t+1}, y_{1:t}) p(x_{t+1} | y_{1:T}) dx_{t+1}. The first factor is Gaussian (joint Gaussian conditioning) with mean x^tt+Jt(xt+1x^t+1t)\hat{x}_{t|t} + J_t(x_{t+1} - \hat{x}_{t+1|t}) and covariance PttJtAPttP_{t|t} - J_t A P_{t|t}. Marginalizing over xt+1N(x^t+1T,Pt+1T)x_{t+1} \sim \mathcal{N}(\hat{x}_{t+1|T}, P_{t+1|T}) recovers the RTS formulas.

Why It Matters

Smoothing matters whenever the offline trajectory is what you care about: post-hoc trajectory reconstruction, EM-based parameter learning (which needs E[xty1:T]\mathbb{E}[x_t | y_{1:T}] in the E-step), and any setting where you can afford to wait for future observations before reporting.

Failure Mode

The RTS smoother inherits all model-misspecification issues from the Kalman filter. It also needs Pt+1tP_{t+1|t} invertible at every step; if the dynamics matrix AA is singular, Pt+1tP_{t+1|t} may be rank-deficient and you should switch to the information-form smoother or a square-root implementation.

EM for Parameter Estimation

When the model matrices θ=(A,C,Q,R,μ0,Σ0)\theta = (A, C, Q, R, \mu_0, \Sigma_0) are unknown, the EM algorithm of Shumway and Stoffer (1982) alternates filtering/smoothing with closed-form M-steps.

E-step. Given current θ(k)\theta^{(k)}, run the Kalman filter and RTS smoother to compute x^tT\hat{x}_{t|T}, PtTP_{t|T}, and the lag-one smoothed cross-covariance Pt,t1TP_{t,t-1|T}.

M-step. Maximize the expected complete-data log-likelihood. The complete-data log-likelihood is logp(x0:T,y1:Tθ)=logp(x0)+t=1Tlogp(xtxt1)+t=1Tlogp(ytxt),\log p(x_{0:T}, y_{1:T} | \theta) = \log p(x_0) + \sum_{t=1}^T \log p(x_t | x_{t-1}) + \sum_{t=1}^T \log p(y_t | x_t), which decomposes into Gaussian likelihoods quadratic in the parameters. Closed-form updates:

A(k+1)=(t=1TE[xtxt1])(t=1TE[xt1xt1])1,A^{(k+1)} = \left(\sum_{t=1}^T \mathbb{E}[x_t x_{t-1}^\top]\right)\left(\sum_{t=1}^T \mathbb{E}[x_{t-1} x_{t-1}^\top]\right)^{-1},

with analogous updates for C,Q,RC, Q, R. The expectations E[xtxt1y1:T]\mathbb{E}[x_t x_{t-1}^\top | y_{1:T}] come from the smoother. EM monotonically increases the marginal likelihood p(y1:Tθ)p(y_{1:T} | \theta).

In practice, EM for state space models converges slowly near the optimum and is sensitive to local minima with many states. Direct gradient descent on the marginal log-likelihood (computed via the prediction-error decomposition logp(y1:T)=tlogN(yt;y^tt1,St)\log p(y_{1:T}) = \sum_t \log \mathcal{N}(y_t; \hat{y}_{t|t-1}, S_t)) often converges faster.

ARIMA as State Space

Theorem

ARIMA-State Space Equivalence

Statement

Every ARIMA(p,d,qp, d, q) process admits a linear Gaussian state space representation with state dimension r=max(p+d,q+1)r = \max(p+d, q+1). Conversely, every linear time-invariant state space model with scalar observation has an ARMA representation in the observation series.

Intuition

The state space form gives ARMA's recurrence a Markov reformulation: pack enough lags into the state vector to make the next state computable from the current one. Conversely, eliminating the state from (yt=Cxt+vt,xt=Axt1+wt)(y_t = Cx_t + v_t, x_t = Ax_{t-1} + w_t) via the lag operator gives a rational transfer function, which is exactly what ARMA encodes.

Proof Sketch

Forward direction (Harvey 1989, Chapter 4). Consider ARMA(p,qp, q) with r=max(p,q+1)r = \max(p, q+1). Define the state xt=(Xt,Xt+1t,,Xt+r1t)x_t = (X_t, X_{t+1|t}, \ldots, X_{t+r-1|t})^\top where Xt+ktX_{t+k|t} is the kk-step-ahead linear forecast. Then xt+1=Axt+bϵt+1,yt=cxt,x_{t+1} = A x_t + b \epsilon_{t+1}, \quad y_t = c^\top x_t, where AA has companion-matrix structure with the AR coefficients on the bottom row, bb contains the MA coefficients (with b0=1b_0 = 1), and c=(1,0,,0)c = (1, 0, \ldots, 0)^\top. For ARIMA, prepend dd accumulator states.

Reverse direction. Eliminate xtx_t from the state equations: C(IAL)1BC(I - AL)^{-1} B in the lag operator gives a rational function Θ(L)/Φ(L)\Theta(L)/\Phi(L), which is the ARMA transfer function on the noise.

Why It Matters

The state space form lets you compute the Gaussian likelihood of an ARIMA model in O(n)O(n) via the Kalman filter, handle missing observations natively (just skip the update step), and incorporate exogenous regressors as control inputs. It is also how arima() in R and statsmodels in Python actually fit ARIMA: they run a Kalman filter on the state space form.

Failure Mode

The state dimension r=max(p+d,q+1)r = \max(p+d, q+1) grows with p+dp+d, which can make the filter expensive for high-order models. The state space representation is non-unique (any invertible linear transformation of the state gives an equivalent model); this redundancy is irrelevant for the likelihood but matters when interpreting the state directly.

Connection to Modern SSMs (S4, Mamba)

Continuous-time linear state space models x˙(t)=Ax(t)+Bu(t), y(t)=Cx(t)+Du(t)\dot{x}(t) = Ax(t) + Bu(t),\ y(t) = Cx(t) + Du(t) are the language of control theory. Discretizing them gives a discrete linear recurrence xt=Aˉxt1+Bˉut, yt=Cˉxtx_t = \bar{A} x_{t-1} + \bar{B} u_t,\ y_t = \bar{C} x_t. This is exactly the form used by S4 (Gu et al. 2022), where AA is parametrized via the HiPPO matrix to give long-range memory, and Mamba (Gu and Dao 2024), which makes Aˉ,Bˉ,Cˉ\bar{A}, \bar{B}, \bar{C} input-dependent.

The connection is structural, not coincidental: a learned linear SSM is doing the same thing the Kalman filter does (recursively summarize past observations into a fixed-size state) with a different parameterization. The classical theory tells you what the inductive bias of these architectures is — they encode bandlimited or rational transfer functions, with limits on what kinds of dependencies they can capture without nonlinearity. See the Mamba and state space models page for the architectural detail.

Common Confusions

Watch Out

The state is not unique

Two different state space models with state vectors related by an invertible linear transformation x~t=Txt\tilde{x}_t = T x_t produce identical observation distributions. The state matrices change accordingly: A~=TAT1, C~=CT1\tilde{A} = T A T^{-1},\ \tilde{C} = C T^{-1}. This nonidentifiability is harmless for prediction and likelihood evaluation but fatal for interpretation. If you want the state to mean something physical (position, velocity), parametrize directly; if you only care about yty_t, fit any equivalent representation.

Watch Out

P_{t|t} does not depend on the data

The covariance recursion Ptt=(IKtC)Ptt1P_{t|t} = (I - K_t C) P_{t|t-1} depends only on A,C,Q,RA, C, Q, R. The actual observations yty_t enter only through the mean update. Consequently you can precompute {Ptt,Kt}\{P_{t|t}, K_t\} before seeing any data; the filter's "uncertainty trajectory" is a function of the model alone. Steady-state Kalman filters exploit this by precomputing KK_\infty from the discrete algebraic Riccati equation.

Watch Out

Smoothing is not just running the filter backward

The RTS smoother is a forward filter followed by a backward sweep that uses smoothed future estimates. It is not the Kalman filter applied to the time-reversed series, because reversing the dynamics requires a different state-transition matrix and noise covariance derived from the original. The two-filter form (information filter forward + information filter backward, then combine) is an alternative implementation, but again with carefully derived backward dynamics.

Summary

  • State space form has a hidden Markov state xtx_t and a noisy observation yt=Cxt+vty_t = Cx_t + v_t. The state carries everything needed from the past.
  • Kalman gain Kt=Ptt1CSt1K_t = P_{t|t-1} C^\top S_t^{-1} minimizes the trace of PttP_{t|t}. The derivation does not require Gaussianity; it gives BLUE in general.
  • The RTS smoother sweeps backward to compute p(xty1:T)p(x_t | y_{1:T}), tighter than the filtered p(xty1:t)p(x_t | y_{1:t}).
  • EM with E-step = filter + smoother and closed-form M-step learns (A,C,Q,R)(A, C, Q, R) from data; gradient descent on the prediction-error decomposition is often faster.
  • Every ARIMA has a state space representation, enabling O(n)O(n) likelihood computation via the Kalman filter.
  • Modern sequence models S4 and Mamba are discretized linear SSMs with learned dynamics; the classical theory describes their inductive bias.

Exercises

ExerciseCore

Problem

A scalar local-level model has xt=xt1+wtx_t = x_{t-1} + w_t, wtN(0,q)w_t \sim \mathcal{N}(0, q), and yt=xt+vty_t = x_t + v_t, vtN(0,r)v_t \sim \mathcal{N}(0, r). Show that the steady-state Kalman gain is K=q+q2+4qr2r.K_\infty = \frac{-q + \sqrt{q^2 + 4qr}}{2r}. Show that K1K_\infty \to 1 as q/rq/r \to \infty and K0K_\infty \to 0 as q/r0q/r \to 0.

ExerciseAdvanced

Problem

Write down a state space representation of the AR(2) model Xt=ϕ1Xt1+ϕ2Xt2+ϵtX_t = \phi_1 X_{t-1} + \phi_2 X_{t-2} + \epsilon_t. Verify that the observation series yt=Xty_t = X_t recovers the original AR(2) recursion.

ExerciseAdvanced

Problem

Suppose you fit a Kalman filter with the wrong process noise covariance Q~=αQ\tilde Q = \alpha Q for some α>0\alpha > 0, but the correct observation matrices. Show that the filter is still BLUE (best linear unbiased) but its reported covariance P~tt\tilde P_{t|t} is wrong. What does the filter overconfidence look like as a function of α\alpha?

References

Canonical:

  • Kalman, R. E. "A New Approach to Linear Filtering and Prediction Problems." Journal of Basic Engineering, 82(1), 1960. The original.
  • Anderson, B. D. O., Moore, J. B. Optimal Filtering. Prentice-Hall, 1979, Chapters 3-7.
  • Harvey, A. C. Forecasting, Structural Time Series Models and the Kalman Filter. Cambridge University Press, 1989, Chapters 3-4.
  • Hamilton, J. D. Time Series Analysis, Princeton University Press, 1994, Chapter 13.
  • Rauch, H. E., Tung, F., Striebel, C. T. "Maximum Likelihood Estimates of Linear Dynamic Systems." AIAA Journal, 3(8), 1965 (RTS smoother).

Current:

  • Sarkka, S. Bayesian Filtering and Smoothing. Cambridge University Press, 2013, Chapters 4-6.
  • Shumway, R. H., Stoffer, D. S. Time Series Analysis and Its Applications, 4th ed., Springer, 2017, Chapter 6.
  • Murphy, K. P. Probabilistic Machine Learning: Advanced Topics, MIT Press, 2023, Chapter 8 (state space models and SSMs as sequence models).
  • Gu, A., Goel, K., Re, C. "Efficiently Modeling Long Sequences with Structured State Spaces (S4)," arXiv:2111.00396, 2022.
  • Gu, A., Dao, T. "Mamba: Linear-Time Sequence Modeling with Selective State Spaces," arXiv:2312.00752, 2024.

Next Topics

Last reviewed: May 6, 2026

Canonical graph

Required before and derived from this topic

These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.

Required prerequisites

4

Derived topics

1

Graph-backed continuations