Applied Math
Time Series Foundations
Rigorous treatment of stationarity, the Wold decomposition, autocorrelation, unit roots, AR/MA/ARMA/ARIMA models, and spectral representation. The classical theory that every modern sequence model rests on.
Prerequisites
Why This Matters
Time series carry temporal structure that i.i.d. methods cannot exploit and cannot ignore. A sample of stock returns, server latencies, ECG voltages, or climate temperatures is a single realization of a stochastic process indexed by time, not a random sample from a distribution. The dependence between and its past sets the statistical properties of every estimator built on top.
The classical theory developed by Yule, Wold, Box, Jenkins, and Hamilton answers four questions. When does a process have stable statistics over time (stationarity)? When can it be represented as an infinite linear combination of past shocks (Wold)? Which finite-parameter family approximates the linear structure (ARMA)? And what does the second-order structure look like in frequency space (spectral density)? Everything that follows — from Kalman filtering to PatchTST — assumes or relaxes one of these answers.
Stationarity
Two notions of stationarity appear in the literature. Most theory uses the weaker (covariance) form because it is what estimators actually need.
Strict Stationarity
A process is strictly stationary if and only if for every , every , and every shift ,
The full joint distribution is shift-invariant.
Weak (Covariance) Stationarity
A process with is weakly stationary (or covariance stationary) if:
- for all ,
- for all ,
- depends only on , not on .
The function is the autocovariance function and is symmetric: .
Strict stationarity does not imply weak stationarity (a strictly stationary Cauchy process has no second moment). Weak stationarity does not imply strict stationarity. Under joint Gaussianity the two coincide because the joint distribution of a Gaussian vector is determined by its first two moments.
Stationarity is not the same as ergodicity
A stationary process can have time averages that fail to converge to ensemble averages. Ergodicity is the additional condition that almost surely. For a Gaussian stationary process, ergodicity holds iff as . Estimating from one realization requires ergodicity, not just stationarity.
Autocorrelation
Autocorrelation Function
The autocorrelation function (ACF) of a covariance-stationary process is , the autocovariance normalized by the variance. , , .
Partial Autocorrelation Function
The partial autocorrelation at lag is the correlation between and after removing the linear effect of the intermediate lags . Concretely, regress on ; the coefficient on is .
The PACF is computed by the Yule-Walker equations. For an AR() process, satisfies the linear system for , which in matrix form is where is the Toeplitz autocovariance matrix.
ACF and PACF identify model order. For AR(), the PACF cuts off at lag and the ACF decays geometrically. For MA(), the ACF cuts off at lag and the PACF decays. The Box-Jenkins identification procedure reads and off the sample ACF and PACF plots.
AR, MA, ARMA, ARIMA
The lag operator acts on a sequence by . Polynomials in are the natural notation for linear time-series models.
AR(p), MA(q), ARMA(p,q)
Let be white noise, , meaning uncorrelated with mean zero and constant variance.
- AR(): where .
- MA(): where .
- ARMA(): .
Stationarity Condition for AR(p)
Statement
The AR() recursion admits a unique covariance-stationary solution iff every root of satisfies . The solution is causal: with coefficients that decay geometrically.
Intuition
exists as a power series in exactly when on a neighborhood of the closed unit disk. The coefficients are the partial fractions of ; their decay rate equals the modulus of the smallest root.
Proof Sketch
Factor and expand each factor as a geometric series , valid for . The product of these series gives with where . Substituting for and applying to gives a convergent series in . Uniqueness follows because any other stationary solution has the same Wold representation.
Why It Matters
This is the test you actually run before fitting AR. For AR(1), the condition reduces to . For AR(2) with parameters , the stationarity region is the triangle , , . Outside this region, the recursion explodes or has a unit root.
Failure Mode
If has a root on the unit circle (), the process has a unit root: shocks accumulate forever and grows linearly in . OLS fits to a unit-root series superconverge at rate instead of , and t-statistics follow the Dickey-Fuller distribution rather than Student's t. Naive inference is invalid.
MA() is always covariance stationary (a finite linear combination of finite-variance white noise has finite variance). The dual condition for MA() is invertibility: every root of outside the unit disk lets you write as an AR().
Differencing
The first-difference operator is , so . The -fold difference is .
ARIMA(p, d, q)
A process is ARIMA() if and only if is ARMA(). The differencing order removes unit roots; the ARMA part models the stationary residual.
For seasonal data, SARIMA() adds a seasonal AR/MA polynomial in and a seasonal difference .
Unit Roots and the ADF Test
A random walk is the prototype non-stationary series. Its variance grows linearly: . Unit-root testing decides whether to model a series in levels or in differences.
The Augmented Dickey-Fuller (ADF) test fits the regression and tests (unit root) against (stationary). Under , the t-statistic on does not follow a standard normal; it follows the Dickey-Fuller distribution, computed numerically and tabulated in Hamilton (1994), Chapter 17.
The KPSS test reverses the null: is stationarity (around a level or trend), is unit root. Combining ADF and KPSS provides robustness because each test has different size distortions under the other's null.
Differencing vs detrending
A unit-root series should be differenced. A trend-stationary series with stationary should be detrended (subtract the fitted trend). Differencing a trend-stationary series leaves an MA(1) error with a unit root in , breaking invertibility. Detrending a true random walk leaves residuals that are still non-stationary. Test for unit roots first, then transform.
The Wold Decomposition
Every covariance-stationary process splits into a perfectly predictable deterministic part and a linear function of past shocks. This is the foundational result that justifies ARMA modeling.
Wold Decomposition (Wold 1938)
Statement
Let be covariance stationary with . Then admits the unique decomposition where:
- , ,
- is white noise: , , for ,
- is deterministic: in , i.e. is a perfect linear function of the infinite past,
- for all .
The decomposition is unique up to the choice of innovation variance.
Intuition
Project onto the closed linear span of its own infinite past. Whatever is projected away — the residual — is a fresh shock orthogonal to the past, by construction. Iterating the projection backward produces the MA() representation. What remains in the projection itself is the deterministic part: a series whose value at time is exactly determined by past values.
Proof Sketch
Work in the Hilbert space with inner product . Define , the closed linear span of the past up to time .
Step 1 (innovations). The innovation is the projection residual when predicting from its strict past. By construction , so for , and stationarity makes constant. So is white noise.
Step 2 (MA expansion). Set . Bessel's inequality gives . Define , which converges in . By construction is the projection of onto .
Step 3 (deterministic remainder). Let . Then for all . Since and spans the innovation directions inside , lies in the orthogonal complement, which is . Elements of this tail- field are perfectly predictable from any earlier subspace; in particular , so is deterministic.
Step 4 (uniqueness). If is another such decomposition, the orthogonality conditions force to project to the same residual , hence and .
Why It Matters
Wold says ARMA is not an arbitrary parametric family but the natural finite-parameter approximation to the universal MA() representation. Any covariance-stationary process can be approximated arbitrarily well in by ARMA models. Combined with differencing for non-stationary inputs, this is the theoretical content of Box-Jenkins methodology.
Failure Mode
Wold is purely linear: it captures the second-order structure and nothing else. Nonlinear dependencies — volatility clustering (GARCH), regime switching (Hamilton 1989), threshold dynamics (TAR) — are invisible to the Wold representation. A series can be Wold-decomposable into white noise yet have strong predictive structure that ARMA cannot capture. The white-noise innovations are uncorrelated, not independent.
Spectral Representation
The autocovariance function and the spectral density are Fourier-pair descriptions of the same second-order structure. Spectral analysis is what you reach for when periodic or quasi-periodic structure dominates.
Spectral Density
For a covariance-stationary process with absolutely summable autocovariances (), the spectral density is the Fourier transform of :
The inverse relation is .
because it is the limit of expected periodograms; this is Bochner's theorem applied to . The variance decomposition shows that allocates total variance across frequencies.
Spectral Representation Theorem
Statement
Every mean-zero covariance-stationary process admits the representation where is a complex-valued process with orthogonal increments: for , and for a non-decreasing spectral distribution . When is absolutely continuous, .
Intuition
A stationary process is a continuous superposition of complex exponentials with random uncorrelated amplitudes. The amount of "energy" at frequency is . AR(2) processes with complex roots show pronounced peaks in at the resonance frequency; white noise has flat .
Proof Sketch
The key tool is Bochner's theorem: a function is the autocovariance of some stationary process iff it is positive semidefinite, which by Bochner means for some non-decreasing . Then construct on a probability space by setting , verify orthogonal increments using stationarity, and check that the inverse Fourier transform recovers .
Why It Matters
The spectral view gives closed-form expressions for ARMA spectra. For ARMA(), . AR roots near the unit circle produce sharp peaks; MA roots near the unit circle produce sharp dips. This is the basis of frequency-domain estimation, Whittle likelihood, and bandpass filtering.
Failure Mode
The spectral density requires summability of . Long-memory processes ( for ) have as at rate . Standard ARMA spectral estimators are misspecified; fractional differencing (ARFIMA) is the right tool.
Worked Example: AR(1) Spectrum
For with , the autocovariance is . The spectral density is For , peaks at (low-frequency dominance, slow drifts). For , peaks at (high-frequency dominance, oscillation). The integral confirms variance conservation.
Common Confusions
White noise is not Gaussian
White noise is uncorrelated with zero mean and constant variance. It need not be Gaussian, independent, or even strictly stationary. A sequence of independent draws from a Cauchy distribution is uncorrelated only formally (no second moment), but bounded non-Gaussian white noise like Bernoulli is fine. The AR/MA theory works for any white-noise innovation; Gaussianity is required only for finite-sample distributions of estimators.
ACF on a single realization is a noisy estimator
The sample ACF has standard error roughly for white noise, but it is biased toward zero in finite samples and the bias grows with . The Bartlett confidence bands assume white noise. Plotting raw ACF/PACF without these bands invites overfitting.
Stationarity is not a property you can confirm; only fail
Tests for stationarity are tests of model assumptions, not of truth. ADF rejects unit root with low power on small samples. KPSS fails to reject stationarity on series with subtle structural breaks. Real series are never strictly stationary; the question is whether the deviation matters for the model you are fitting.
Summary
- Two stationarity notions: strict (full distribution shift-invariant) and weak (mean, variance, autocovariance shift-invariant). Theory uses weak.
- ACF and PACF identify ARMA model order via Box-Jenkins. AR cuts off in PACF; MA cuts off in ACF.
- AR() is stationary iff every root of lies strictly outside the unit disk. Roots on the disk give unit roots, which require differencing.
- Wold decomposition: every covariance-stationary process is an MA() plus a deterministic remainder. ARMA is the natural finite-parameter approximation.
- Spectral density is the Fourier transform of . AR roots near the unit circle produce peaks; MA roots near the unit circle produce dips.
Exercises
Problem
Show that the autocovariance function of an MA(1) process is , , for . What is in terms of ? For which is the process invertible?
Problem
Two MA(1) processes with parameters and (and noise variances and respectively) have the same autocovariance function. Verify this and explain why invertibility is an identifiability constraint, not a stationarity constraint.
Problem
Show that the Wold innovations in the decomposition are uncorrelated but need not be independent. Construct a covariance-stationary process whose Wold innovations are dependent.
References
Canonical:
- Box, G. E. P., Jenkins, G. M., Reinsel, G. C., Ljung, G. M. Time Series Analysis: Forecasting and Control, 5th ed., Wiley, 2015 (originally 1970), Chapters 3-5.
- Hamilton, J. D. Time Series Analysis, Princeton University Press, 1994, Chapters 3, 4, 6, 17.
- Brockwell, P. J., Davis, R. A. Time Series: Theory and Methods, 2nd ed., Springer, 1991, Chapters 3, 4, 5.
- Wold, H. A Study in the Analysis of Stationary Time Series, Almqvist & Wiksell, Stockholm, 1938 (foundational; Wold decomposition).
- Dickey, D. A., Fuller, W. A. "Distribution of the Estimators for Autoregressive Time Series with a Unit Root." Journal of the American Statistical Association, 74(366), 1979.
Current:
- Shumway, R. H., Stoffer, D. S. Time Series Analysis and Its Applications: With R Examples, 4th ed., Springer, 2017, Chapters 1-4.
- Hyndman, R. J., Athanasopoulos, G. Forecasting: Principles and Practice, 3rd ed., OTexts, 2021, Chapters 8-9.
- Tsay, R. S. Analysis of Financial Time Series, 3rd ed., Wiley, 2010, Chapters 2-3.
- Kwiatkowski, D., Phillips, P. C. B., Schmidt, P., Shin, Y. "Testing the Null Hypothesis of Stationarity Against the Alternative of a Unit Root." Journal of Econometrics, 54(1-3), 1992.
Next Topics
- State space models: every ARIMA has a state space representation; Kalman filtering subsumes ARMA likelihood evaluation.
- Time series forecasting basics: practical model selection, exponential smoothing, modern foundation models.
- Deep learning for time series: how LSTMs, TCNs, and PatchTST relate to (and depart from) the ARMA framework.
- Stochastic processes for ML: continuous-time analogues, sample-path regularity, martingale methods.
Last reviewed: May 6, 2026
Canonical graph
Required before and derived from this topic
These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.
Required prerequisites
3- Expectation, Variance, Covariance, and Momentslayer 0A · tier 1
- Kolmogorov Probability Axiomslayer 0A · tier 1
- Stochastic Processes for MLlayer 2 · tier 2
Derived topics
2- State Space Modelslayer 2 · tier 2
- Deep Learning for Time Serieslayer 3 · tier 2
Graph-backed continuations