Applied Math
State Space Models
Linear state space form, the Kalman filter and RTS smoother, EM for parameter learning, and the ARIMA equivalence. The unifying framework behind classical filtering and modern Mamba/S4.
Prerequisites
Why This Matters
A state space model represents a sequence as the observation of a hidden, time-evolving state. The state carries everything you need from the past; given the state at time , the future is conditionally independent of the past. This Markov factorization is what makes filtering, smoothing, and likelihood evaluation tractable.
Three reasons to care. First, every ARIMA model has a state space representation, and the Kalman filter computes the exact Gaussian likelihood in time, including with missing observations — try doing that with a moving-average regression. Second, state space is the language of control, robotics, finance, and target tracking; the Kalman filter is the workhorse of inertial navigation and GPS fusion. Third, the recent generation of sequence models (S4, Mamba) is literally a discretized linear state space model with learned dynamics, and the connection matters for understanding their inductive bias and computational properties.
State Space Form
Linear Gaussian State Space Model
A linear Gaussian state space model is a pair of equations:
with initial state . The first is the state equation (or transition equation); is the hidden state, is the transition matrix, is a known control input, is process noise. The second is the observation equation (or measurement equation); is the observation, is the observation matrix, is measurement noise. The noise sequences are independent of each other and of .
The matrices may be time-varying; we write when they are time-invariant. The model has Markov structure: , and observations are conditionally independent given the state.
The Filtering Problem
Filtering computes , the posterior over the current state given observations up to now. Under the linear Gaussian assumptions this posterior is itself Gaussian; let and .
The Kalman filter alternates two steps.
Prediction step (push the posterior forward through the dynamics):
Update step (incorporate observation ):
is the Kalman gain. The detailed Kalman filter page covers the recursion, examples, and the EKF/UKF extensions; here we focus on the gain derivation and the smoother, EM, and ARIMA equivalence.
Deriving the Kalman Gain
The clean way to derive is by minimizing the trace of , which is the posterior MSE.
Kalman Gain Minimizes Posterior MSE
Statement
Among all linear estimators of the form , the unique gain matrix that minimizes is The minimum posterior covariance is .
Intuition
The innovation has covariance . The Kalman gain weights the innovation by the cross-covariance between state error and innovation, , divided by the innovation's own covariance. This is the standard regression coefficient of state error on innovation.
Proof Sketch
Let be the prediction error, with and . Then where is independent of . After updating with gain , the residual is , so
This is a quadratic form in . Taking gives which rearranges to , hence .
Substituting back: , where the last equality uses (read off from the gradient condition).
Why It Matters
The MSE-minimization derivation does not require Gaussianity. Among linear estimators of from , the Kalman filter is optimal regardless of the noise distribution. Under Gaussianity it is also the conditional expectation (the global MMSE estimator); under non-Gaussian noise it is still the best linear unbiased estimator (BLUE). This is why the filter is so robust in engineering practice.
Failure Mode
is computed from model matrices, not from data. If or is misspecified, the filter is consistent but suboptimal: it weights innovations incorrectly and reports an overconfident covariance. If the dynamics are nonlinear or the noise is heavy-tailed, the linear estimator is no longer MMSE; the EKF/UKF/particle filter use linearization or sampling to approximate the true posterior. If becomes near-singular numerically, use the Joseph form , which is symmetric and positive definite by construction.
The RTS Smoother
Smoothing computes for : the posterior over a past state given the entire trajectory of observations. Smoothed estimates are tighter than filtered ones because future observations carry information about past states.
The Rauch-Tung-Striebel (RTS) smoother runs the Kalman filter forward to obtain and , then sweeps backward.
Rauch-Tung-Striebel Smoother
Statement
Define the smoothed mean and covariance by the backward recursion, starting from (the final filter output):
These equal the Gaussian posterior .
Intuition
is a backward Kalman gain: the regression coefficient of on given , computed from the joint Gaussian Conditioning on gives the Gaussian regression formula; replacing with its smoothed mean and absorbing its smoothed covariance produces the RTS update.
Proof Sketch
Use the Markov property of the smoothed posterior: . The first factor is Gaussian (joint Gaussian conditioning) with mean and covariance . Marginalizing over recovers the RTS formulas.
Why It Matters
Smoothing matters whenever the offline trajectory is what you care about: post-hoc trajectory reconstruction, EM-based parameter learning (which needs in the E-step), and any setting where you can afford to wait for future observations before reporting.
Failure Mode
The RTS smoother inherits all model-misspecification issues from the Kalman filter. It also needs invertible at every step; if the dynamics matrix is singular, may be rank-deficient and you should switch to the information-form smoother or a square-root implementation.
EM for Parameter Estimation
When the model matrices are unknown, the EM algorithm of Shumway and Stoffer (1982) alternates filtering/smoothing with closed-form M-steps.
E-step. Given current , run the Kalman filter and RTS smoother to compute , , and the lag-one smoothed cross-covariance .
M-step. Maximize the expected complete-data log-likelihood. The complete-data log-likelihood is which decomposes into Gaussian likelihoods quadratic in the parameters. Closed-form updates:
with analogous updates for . The expectations come from the smoother. EM monotonically increases the marginal likelihood .
In practice, EM for state space models converges slowly near the optimum and is sensitive to local minima with many states. Direct gradient descent on the marginal log-likelihood (computed via the prediction-error decomposition ) often converges faster.
ARIMA as State Space
ARIMA-State Space Equivalence
Statement
Every ARIMA() process admits a linear Gaussian state space representation with state dimension . Conversely, every linear time-invariant state space model with scalar observation has an ARMA representation in the observation series.
Intuition
The state space form gives ARMA's recurrence a Markov reformulation: pack enough lags into the state vector to make the next state computable from the current one. Conversely, eliminating the state from via the lag operator gives a rational transfer function, which is exactly what ARMA encodes.
Proof Sketch
Forward direction (Harvey 1989, Chapter 4). Consider ARMA() with . Define the state where is the -step-ahead linear forecast. Then where has companion-matrix structure with the AR coefficients on the bottom row, contains the MA coefficients (with ), and . For ARIMA, prepend accumulator states.
Reverse direction. Eliminate from the state equations: in the lag operator gives a rational function , which is the ARMA transfer function on the noise.
Why It Matters
The state space form lets you compute the Gaussian likelihood of an ARIMA model in via the Kalman filter, handle missing observations natively (just skip the update step), and incorporate exogenous regressors as control inputs. It is also how arima() in R and statsmodels in Python actually fit ARIMA: they run a Kalman filter on the state space form.
Failure Mode
The state dimension grows with , which can make the filter expensive for high-order models. The state space representation is non-unique (any invertible linear transformation of the state gives an equivalent model); this redundancy is irrelevant for the likelihood but matters when interpreting the state directly.
Connection to Modern SSMs (S4, Mamba)
Continuous-time linear state space models are the language of control theory. Discretizing them gives a discrete linear recurrence . This is exactly the form used by S4 (Gu et al. 2022), where is parametrized via the HiPPO matrix to give long-range memory, and Mamba (Gu and Dao 2024), which makes input-dependent.
The connection is structural, not coincidental: a learned linear SSM is doing the same thing the Kalman filter does (recursively summarize past observations into a fixed-size state) with a different parameterization. The classical theory tells you what the inductive bias of these architectures is — they encode bandlimited or rational transfer functions, with limits on what kinds of dependencies they can capture without nonlinearity. See the Mamba and state space models page for the architectural detail.
Common Confusions
The state is not unique
Two different state space models with state vectors related by an invertible linear transformation produce identical observation distributions. The state matrices change accordingly: . This nonidentifiability is harmless for prediction and likelihood evaluation but fatal for interpretation. If you want the state to mean something physical (position, velocity), parametrize directly; if you only care about , fit any equivalent representation.
P_{t|t} does not depend on the data
The covariance recursion depends only on . The actual observations enter only through the mean update. Consequently you can precompute before seeing any data; the filter's "uncertainty trajectory" is a function of the model alone. Steady-state Kalman filters exploit this by precomputing from the discrete algebraic Riccati equation.
Smoothing is not just running the filter backward
The RTS smoother is a forward filter followed by a backward sweep that uses smoothed future estimates. It is not the Kalman filter applied to the time-reversed series, because reversing the dynamics requires a different state-transition matrix and noise covariance derived from the original. The two-filter form (information filter forward + information filter backward, then combine) is an alternative implementation, but again with carefully derived backward dynamics.
Summary
- State space form has a hidden Markov state and a noisy observation . The state carries everything needed from the past.
- Kalman gain minimizes the trace of . The derivation does not require Gaussianity; it gives BLUE in general.
- The RTS smoother sweeps backward to compute , tighter than the filtered .
- EM with E-step = filter + smoother and closed-form M-step learns from data; gradient descent on the prediction-error decomposition is often faster.
- Every ARIMA has a state space representation, enabling likelihood computation via the Kalman filter.
- Modern sequence models S4 and Mamba are discretized linear SSMs with learned dynamics; the classical theory describes their inductive bias.
Exercises
Problem
A scalar local-level model has , , and , . Show that the steady-state Kalman gain is Show that as and as .
Problem
Write down a state space representation of the AR(2) model . Verify that the observation series recovers the original AR(2) recursion.
Problem
Suppose you fit a Kalman filter with the wrong process noise covariance for some , but the correct observation matrices. Show that the filter is still BLUE (best linear unbiased) but its reported covariance is wrong. What does the filter overconfidence look like as a function of ?
References
Canonical:
- Kalman, R. E. "A New Approach to Linear Filtering and Prediction Problems." Journal of Basic Engineering, 82(1), 1960. The original.
- Anderson, B. D. O., Moore, J. B. Optimal Filtering. Prentice-Hall, 1979, Chapters 3-7.
- Harvey, A. C. Forecasting, Structural Time Series Models and the Kalman Filter. Cambridge University Press, 1989, Chapters 3-4.
- Hamilton, J. D. Time Series Analysis, Princeton University Press, 1994, Chapter 13.
- Rauch, H. E., Tung, F., Striebel, C. T. "Maximum Likelihood Estimates of Linear Dynamic Systems." AIAA Journal, 3(8), 1965 (RTS smoother).
Current:
- Sarkka, S. Bayesian Filtering and Smoothing. Cambridge University Press, 2013, Chapters 4-6.
- Shumway, R. H., Stoffer, D. S. Time Series Analysis and Its Applications, 4th ed., Springer, 2017, Chapter 6.
- Murphy, K. P. Probabilistic Machine Learning: Advanced Topics, MIT Press, 2023, Chapter 8 (state space models and SSMs as sequence models).
- Gu, A., Goel, K., Re, C. "Efficiently Modeling Long Sequences with Structured State Spaces (S4)," arXiv:2111.00396, 2022.
- Gu, A., Dao, T. "Mamba: Linear-Time Sequence Modeling with Selective State Spaces," arXiv:2312.00752, 2024.
Next Topics
- Kalman filter: detailed treatment with EKF, UKF, and worked examples.
- Particle filters: sequential Monte Carlo for nonlinear, non-Gaussian SSMs.
- Mamba and state space models: how learned SSMs compete with attention.
- Deep learning for time series: where neural sequence models inherit (and depart from) the SSM framework.
- Time series forecasting basics: practical model selection across ARIMA, ETS, and modern alternatives.
Last reviewed: May 6, 2026
Canonical graph
Required before and derived from this topic
These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.
Required prerequisites
4- Expectation, Variance, Covariance, and Momentslayer 0A · tier 1
- Kalman Filterlayer 2 · tier 1
- Markov Chains and Steady Statelayer 1 · tier 2
- Time Series Foundationslayer 2 · tier 2
Derived topics
1- Deep Learning for Time Serieslayer 3 · tier 2
Graph-backed continuations