Time Series Foundations

Sneiderman, Robby

Applied Math

Time Series Foundations

Rigorous treatment of stationarity, the Wold decomposition, autocorrelation, unit roots, AR/MA/ARMA/ARIMA models, and spectral representation. The classical theory that every modern sequence model rests on.

AdvancedTier 2StableCore spine~70 min

Prerequisites

Kolmogorov Probability Axioms Expectation Variance Covariance Moments Stochastic Processes ML

Quiz (5)Pulse Check Prereq Map

Why This Matters

Time series carry temporal structure that i.i.d. methods cannot exploit and cannot ignore. A sample of stock returns, server latencies, ECG voltages, or climate temperatures is a single realization of a stochastic process indexed by time, not a random sample from a distribution. The dependence between $X_t$ and its past sets the statistical properties of every estimator built on top.

The classical theory developed by Yule, Wold, Box, Jenkins, and Hamilton answers four questions. When does a process have stable statistics over time (stationarity)? When can it be represented as an infinite linear combination of past shocks (Wold)? Which finite-parameter family approximates the linear structure (ARMA)? And what does the second-order structure look like in frequency space (spectral density)? Everything that follows — from Kalman filtering to PatchTST — assumes or relaxes one of these answers.

Stationarity

Two notions of stationarity appear in the literature. Most theory uses the weaker (covariance) form because it is what estimators actually need.

Definition

Strict Stationarity

A process $\{X_t\}_{t \in \mathbb{Z}}$ is strictly stationary if and only if for every $k \geq 1$ , every $t_1 < \cdots < t_k$ , and every shift $h$ ,

$(X_{t_1}, \ldots, X_{t_k}) \stackrel{d}{=} (X_{t_1+h}, \ldots, X_{t_k+h}).$

The full joint distribution is shift-invariant.

Definition

Weak (Covariance) Stationarity

A process $\{X_t\}$ with $\mathbb{E}[X_t^2] < \infty$ is weakly stationary (or covariance stationary) if:

$\mathbb{E}[X_t] = \mu$ for all $t$ ,
$\text{Var}(X_t) = \sigma^2 < \infty$ for all $t$ ,
$\gamma(h) := \text{Cov}(X_t, X_{t+h})$ depends only on $h$ , not on $t$ .

The function $\gamma : \mathbb{Z} \to \mathbb{R}$ is the autocovariance function and is symmetric: $\gamma(-h) = \gamma(h)$ .

Strict stationarity does not imply weak stationarity (a strictly stationary Cauchy process has no second moment). Weak stationarity does not imply strict stationarity. Under joint Gaussianity the two coincide because the joint distribution of a Gaussian vector is determined by its first two moments.

Watch Out

Stationarity is not the same as ergodicity

A stationary process can have time averages that fail to converge to ensemble averages. Ergodicity is the additional condition that $\frac{1}{n}\sum_{t=1}^n f(X_t) \to \mathbb{E}[f(X_0)]$ almost surely. For a Gaussian stationary process, ergodicity holds iff $\gamma(h) \to 0$ as $h \to \infty$ . Estimating $\mu$ from one realization requires ergodicity, not just stationarity.

Autocorrelation

Definition

Autocorrelation Function $ρ (h)$

The autocorrelation function (ACF) of a covariance-stationary process is $\rho(h) = \gamma(h) / \gamma(0)$ , the autocovariance normalized by the variance. $|\rho(h)| \leq 1$ , $\rho(0) = 1$ , $\rho(-h) = \rho(h)$ .

Definition

Partial Autocorrelation Function $α (h)$

The partial autocorrelation at lag $h$ is the correlation between $X_t$ and $X_{t-h}$ after removing the linear effect of the intermediate lags $X_{t-1}, \ldots, X_{t-h+1}$ . Concretely, regress $X_t$ on $X_{t-1}, \ldots, X_{t-h}$ ; the coefficient on $X_{t-h}$ is $\alpha(h)$ .

The PACF is computed by the Yule-Walker equations. For an AR( $p$ ) process, $\rho(h)$ satisfies the linear system $\rho(h) = \phi_1 \rho(h-1) + \cdots + \phi_p \rho(h-p)$ for $h \geq 1$ , which in matrix form is $\Gamma \boldsymbol{\phi} = \boldsymbol{\gamma}$ where $\Gamma_{ij} = \gamma(|i-j|)$ is the Toeplitz autocovariance matrix.

ACF and PACF identify model order. For AR( $p$ ), the PACF cuts off at lag $p$ and the ACF decays geometrically. For MA( $q$ ), the ACF cuts off at lag $q$ and the PACF decays. The Box-Jenkins identification procedure reads $p$ and $q$ off the sample ACF and PACF plots.

AR, MA, ARMA, ARIMA

The lag operator $L$ acts on a sequence by $L X_t = X_{t-1}$ . Polynomials in $L$ are the natural notation for linear time-series models.

Definition

AR(p), MA(q), ARMA(p,q)

Let $\{\epsilon_t\}$ be white noise, $\epsilon_t \sim \text{WN}(0, \sigma^2)$ , meaning uncorrelated with mean zero and constant variance.

AR( $p$ ): $\Phi(L) X_t = \epsilon_t$ where $\Phi(L) = 1 - \phi_1 L - \cdots - \phi_p L^p$ .
MA( $q$ ): $X_t = \Theta(L) \epsilon_t$ where $\Theta(L) = 1 + \theta_1 L + \cdots + \theta_q L^q$ .
ARMA( $p, q$ ): $\Phi(L) X_t = \Theta(L) \epsilon_t$ .

Theorem

Stationarity Condition for AR(p)

Statement

The AR( $p$ ) recursion $\Phi(L) X_t = \epsilon_t$ admits a unique covariance-stationary solution iff every root of $\Phi(z) = 1 - \phi_1 z - \cdots - \phi_p z^p$ satisfies $|z| > 1$ . The solution is causal: $X_t = \Phi(L)^{-1} \epsilon_t = \sum_{j=0}^\infty \psi_j \epsilon_{t-j}$ with coefficients $\psi_j$ that decay geometrically.

Intuition

$\Phi(L)^{-1}$ exists as a power series in $L$ exactly when $\Phi(z) \neq 0$ on a neighborhood of the closed unit disk. The coefficients $\psi_j$ are the partial fractions of $\Phi(z)^{-1}$ ; their decay rate equals the modulus of the smallest root.

Proof Sketch

Factor $\Phi(z) = \prod_i (1 - z/z_i)$ and expand each factor as a geometric series $1/(1 - z/z_i) = \sum_j (z/z_i)^j$ , valid for $|z| < |z_i|$ . The product of these series gives $\Phi(z)^{-1} = \sum_j \psi_j z^j$ with $|\psi_j| \leq C \rho^j$ where $\rho = 1/\min_i |z_i| < 1$ . Substituting $L$ for $z$ and applying to $\epsilon_t$ gives a convergent series in $L^2$ . Uniqueness follows because any other stationary solution has the same Wold representation.

Why It Matters

This is the test you actually run before fitting AR. For AR(1), the condition reduces to $|\phi_1| < 1$ . For AR(2) with parameters $(\phi_1, \phi_2)$ , the stationarity region is the triangle $\phi_2 + \phi_1 < 1$ , $\phi_2 - \phi_1 < 1$ , $|\phi_2| < 1$ . Outside this region, the recursion explodes or has a unit root.

Failure Mode

If $\Phi$ has a root on the unit circle ( $|z_i| = 1$ ), the process has a unit root: shocks accumulate forever and $\text{Var}(X_t)$ grows linearly in $t$ . OLS fits to a unit-root series superconverge at rate $n$ instead of $\sqrt{n}$ , and t-statistics follow the Dickey-Fuller distribution rather than Student's t. Naive inference is invalid.

report a correction →

MA( $q$ ) is always covariance stationary (a finite linear combination of finite-variance white noise has finite variance). The dual condition for MA( $q$ ) is invertibility: every root of $\Theta(z)$ outside the unit disk lets you write $\epsilon_t = \Theta(L)^{-1} X_t$ as an AR( $\infty$ ).

Definition

Differencing

The first-difference operator is $\nabla = 1 - L$ , so $\nabla X_t = X_t - X_{t-1}$ . The $d$ -fold difference is $\nabla^d = (1-L)^d$ .

Definition

ARIMA(p, d, q)

A process $\{X_t\}$ is ARIMA( $p, d, q$ ) if and only if $\{\nabla^d X_t\}$ is ARMA( $p, q$ ). The differencing order $d$ removes $d$ unit roots; the ARMA part models the stationary residual.

For seasonal data, SARIMA( $p,d,q$ ) $(P,D,Q)_s$ adds a seasonal AR/MA polynomial in $L^s$ and a seasonal difference $\nabla_s = 1 - L^s$ .

Unit Roots and the ADF Test

A random walk $X_t = X_{t-1} + \epsilon_t$ is the prototype non-stationary series. Its variance grows linearly: $\text{Var}(X_t - X_0) = t \sigma^2$ . Unit-root testing decides whether to model a series in levels or in differences.

The Augmented Dickey-Fuller (ADF) test fits the regression $\nabla X_t = \alpha + \beta t + \gamma X_{t-1} + \sum_{j=1}^k \delta_j \nabla X_{t-j} + \epsilon_t$ and tests $H_0: \gamma = 0$ (unit root) against $H_1: \gamma < 0$ (stationary). Under $H_0$ , the t-statistic on $\hat\gamma$ does not follow a standard normal; it follows the Dickey-Fuller distribution, computed numerically and tabulated in Hamilton (1994), Chapter 17.

The KPSS test reverses the null: $H_0$ is stationarity (around a level or trend), $H_1$ is unit root. Combining ADF and KPSS provides robustness because each test has different size distortions under the other's null.

Watch Out

Differencing vs detrending

A unit-root series $X_t = X_{t-1} + \epsilon_t$ should be differenced. A trend-stationary series $X_t = \alpha + \beta t + u_t$ with stationary $u_t$ should be detrended (subtract the fitted trend). Differencing a trend-stationary series leaves an MA(1) error with a unit root in $\Theta(L)$ , breaking invertibility. Detrending a true random walk leaves residuals that are still non-stationary. Test for unit roots first, then transform.

The Wold Decomposition

Every covariance-stationary process splits into a perfectly predictable deterministic part and a linear function of past shocks. This is the foundational result that justifies ARMA modeling.

Theorem

Wold Decomposition (Wold 1938)

Statement

Let $\{X_t\}$ be covariance stationary with $\mathbb{E}[X_t] = 0$ . Then $X_t$ admits the unique decomposition $X_t = \sum_{j=0}^\infty \psi_j \epsilon_{t-j} + V_t$ where:

$\psi_0 = 1$ , $\sum_{j=0}^\infty \psi_j^2 < \infty$ ,
$\{\epsilon_t\}$ is white noise: $\mathbb{E}[\epsilon_t] = 0$ , $\mathbb{E}[\epsilon_t^2] = \sigma^2$ , $\mathbb{E}[\epsilon_s \epsilon_t] = 0$ for $s \neq t$ ,
$V_t$ is deterministic: $V_t \in \overline{\text{span}}\{X_s : s < t\}$ in $L^2$ , i.e. $V_t$ is a perfect linear function of the infinite past,
$\mathbb{E}[\epsilon_t V_s] = 0$ for all $s, t$ .

The decomposition is unique up to the choice of innovation variance.

Intuition

Project $X_t$ onto the closed linear span of its own infinite past. Whatever is projected away — the residual — is a fresh shock orthogonal to the past, by construction. Iterating the projection backward produces the MA( $\infty$ ) representation. What remains in the projection itself is the deterministic part: a series whose value at time $t$ is exactly determined by past values.

Proof Sketch

Work in the Hilbert space $\mathcal{H} = L^2(\Omega, \mathcal{F}, P)$ with inner product $\langle X, Y \rangle = \mathbb{E}[XY]$ . Define $\mathcal{H}_t = \overline{\text{span}}\{X_s : s \leq t\}$ , the closed linear span of the past up to time $t$ .

Step 1 (innovations). The innovation $\epsilon_t = X_t - P_{\mathcal{H}_{t-1}} X_t$ is the projection residual when predicting $X_t$ from its strict past. By construction $\epsilon_t \perp \mathcal{H}_{t-1}$ , so $\mathbb{E}[\epsilon_s \epsilon_t] = 0$ for $s < t$ , and stationarity makes $\mathbb{E}[\epsilon_t^2]$ constant. So $\{\epsilon_t\}$ is white noise.

Step 2 (MA expansion). Set $\psi_j = \langle X_t, \epsilon_{t-j} \rangle / \sigma^2$ . Bessel's inequality gives $\sum_{j=0}^\infty \psi_j^2 \sigma^2 \leq \mathbb{E}[X_t^2] < \infty$ . Define $U_t = \sum_{j=0}^\infty \psi_j \epsilon_{t-j}$ , which converges in $L^2$ . By construction $U_t$ is the projection of $X_t$ onto $\overline{\text{span}}\{\epsilon_s : s \leq t\}$ .

Step 3 (deterministic remainder). Let $V_t = X_t - U_t$ . Then $V_t \perp \epsilon_s$ for all $s \leq t$ . Since $X_t \in \mathcal{H}_t$ and $\epsilon_s$ spans the innovation directions inside $\mathcal{H}_t$ , $V_t$ lies in the orthogonal complement, which is $\bigcap_{s} \mathcal{H}_s$ . Elements of this tail- $\sigma$ field are perfectly predictable from any earlier subspace; in particular $V_t \in \mathcal{H}_{t-1}$ , so $V_t$ is deterministic.

Step 4 (uniqueness). If $X_t = \sum \tilde\psi_j \tilde\epsilon_{t-j} + \tilde V_t$ is another such decomposition, the orthogonality conditions force $\tilde\epsilon_t$ to project to the same residual $\epsilon_t$ , hence $\tilde\psi_j = \psi_j$ and $\tilde V_t = V_t$ .

Why It Matters

Wold says ARMA is not an arbitrary parametric family but the natural finite-parameter approximation to the universal MA( $\infty$ ) representation. Any covariance-stationary process can be approximated arbitrarily well in $L^2$ by ARMA models. Combined with $\nabla^d$ differencing for non-stationary inputs, this is the theoretical content of Box-Jenkins methodology.

Failure Mode

Wold is purely linear: it captures the second-order structure and nothing else. Nonlinear dependencies — volatility clustering (GARCH), regime switching (Hamilton 1989), threshold dynamics (TAR) — are invisible to the Wold representation. A series can be Wold-decomposable into white noise yet have strong predictive structure that ARMA cannot capture. The white-noise innovations are uncorrelated, not independent.

report a correction →

Spectral Representation

The autocovariance function and the spectral density are Fourier-pair descriptions of the same second-order structure. Spectral analysis is what you reach for when periodic or quasi-periodic structure dominates.

Definition

Spectral Density $S (ω)$

For a covariance-stationary process with absolutely summable autocovariances ( $\sum_h |\gamma(h)| < \infty$ ), the spectral density is the Fourier transform of $\gamma$ :

$S(\omega) = \frac{1}{2\pi} \sum_{h=-\infty}^\infty \gamma(h) e^{-i\omega h}, \qquad \omega \in [-\pi, \pi].$

The inverse relation is $\gamma(h) = \int_{-\pi}^\pi e^{i\omega h} S(\omega)\, d\omega$ .

$S(\omega) \geq 0$ because it is the limit of expected periodograms; this is Bochner's theorem applied to $\gamma$ . The variance decomposition $\gamma(0) = \int_{-\pi}^\pi S(\omega) d\omega$ shows that $S(\omega)$ allocates total variance across frequencies.

Theorem

Spectral Representation Theorem

Statement

Every mean-zero covariance-stationary process $\{X_t\}$ admits the representation $X_t = \int_{-\pi}^\pi e^{i\omega t}\, dZ(\omega)$ where $Z(\omega)$ is a complex-valued process with orthogonal increments: $\mathbb{E}[dZ(\omega) \overline{dZ(\omega')}] = 0$ for $\omega \neq \omega'$ , and $\mathbb{E}[|dZ(\omega)|^2] = dF(\omega)$ for a non-decreasing spectral distribution $F$ . When $F$ is absolutely continuous, $dF/d\omega = S(\omega)$ .

Intuition

A stationary process is a continuous superposition of complex exponentials with random uncorrelated amplitudes. The amount of "energy" at frequency $\omega$ is $S(\omega)\, d\omega$ . AR(2) processes with complex roots show pronounced peaks in $S(\omega)$ at the resonance frequency; white noise has flat $S(\omega) = \sigma^2 / (2\pi)$ .

Proof Sketch

The key tool is Bochner's theorem: a function $\gamma : \mathbb{Z} \to \mathbb{C}$ is the autocovariance of some stationary process iff it is positive semidefinite, which by Bochner means $\gamma(h) = \int e^{i\omega h} dF(\omega)$ for some non-decreasing $F$ . Then construct $Z$ on a probability space by setting $Z(\omega) = \sum_t X_t \cdot \int_{-\pi}^\omega e^{-i u t} du / (2\pi)$ , verify orthogonal increments using stationarity, and check that the inverse Fourier transform recovers $X_t$ .

Why It Matters

The spectral view gives closed-form expressions for ARMA spectra. For ARMA( $p, q$ ), $S(\omega) = (\sigma^2 / 2\pi) |\Theta(e^{-i\omega})|^2 / |\Phi(e^{-i\omega})|^2$ . AR roots near the unit circle produce sharp peaks; MA roots near the unit circle produce sharp dips. This is the basis of frequency-domain estimation, Whittle likelihood, and bandpass filtering.

Failure Mode

The spectral density requires summability of $\gamma$ . Long-memory processes ( $\gamma(h) \sim h^{2d-1}$ for $d \in (0, 1/2)$ ) have $S(\omega) \to \infty$ as $\omega \to 0$ at rate $\omega^{-2d}$ . Standard ARMA spectral estimators are misspecified; fractional differencing (ARFIMA) is the right tool.

report a correction →

Worked Example: AR(1) Spectrum

For $X_t = \phi X_{t-1} + \epsilon_t$ with $|\phi| < 1$ , the autocovariance is $\gamma(h) = \sigma^2 \phi^{|h|} / (1 - \phi^2)$ . The spectral density is $S(\omega) = \frac{\sigma^2}{2\pi} \cdot \frac{1}{|1 - \phi e^{-i\omega}|^2} = \frac{\sigma^2}{2\pi} \cdot \frac{1}{1 - 2\phi \cos\omega + \phi^2}.$ For $\phi > 0$ , $S$ peaks at $\omega = 0$ (low-frequency dominance, slow drifts). For $\phi < 0$ , $S$ peaks at $\omega = \pi$ (high-frequency dominance, oscillation). The integral $\int_{-\pi}^\pi S(\omega) d\omega = \sigma^2 / (1 - \phi^2) = \gamma(0)$ confirms variance conservation.

Common Confusions

Watch Out

White noise is not Gaussian

White noise is uncorrelated with zero mean and constant variance. It need not be Gaussian, independent, or even strictly stationary. A sequence of independent draws from a Cauchy distribution is uncorrelated only formally (no second moment), but bounded non-Gaussian white noise like $\pm 1$ Bernoulli is fine. The AR/MA theory works for any white-noise innovation; Gaussianity is required only for finite-sample distributions of estimators.

Watch Out

ACF on a single realization is a noisy estimator

The sample ACF $\hat\rho(h) = \frac{\sum_{t=1}^{n-h}(X_t - \bar X)(X_{t+h} - \bar X)}{\sum_{t=1}^n (X_t - \bar X)^2}$ has standard error roughly $1/\sqrt{n}$ for white noise, but it is biased toward zero in finite samples and the bias grows with $h$ . The Bartlett confidence bands $\pm 1.96/\sqrt{n}$ assume white noise. Plotting raw ACF/PACF without these bands invites overfitting.

Watch Out

Stationarity is not a property you can confirm; only fail

Tests for stationarity are tests of model assumptions, not of truth. ADF rejects unit root with low power on small samples. KPSS fails to reject stationarity on series with subtle structural breaks. Real series are never strictly stationary; the question is whether the deviation matters for the model you are fitting.

Summary

Two stationarity notions: strict (full distribution shift-invariant) and weak (mean, variance, autocovariance shift-invariant). Theory uses weak.
ACF and PACF identify ARMA model order via Box-Jenkins. AR cuts off in PACF; MA cuts off in ACF.
AR( $p$ ) is stationary iff every root of $\Phi(z)$ lies strictly outside the unit disk. Roots on the disk give unit roots, which require differencing.
Wold decomposition: every covariance-stationary process is an MA( $\infty$ ) plus a deterministic remainder. ARMA is the natural finite-parameter approximation.
Spectral density $S(\omega)$ is the Fourier transform of $\gamma(h)$ . AR roots near the unit circle produce peaks; MA roots near the unit circle produce dips.

Exercises

ExerciseCore

Problem

Show that the autocovariance function of an MA(1) process $X_t = \epsilon_t + \theta \epsilon_{t-1}$ is $\gamma(0) = \sigma^2(1 + \theta^2)$ , $\gamma(\pm 1) = \sigma^2 \theta$ , $\gamma(h) = 0$ for $|h| \geq 2$ . What is $\rho(1)$ in terms of $\theta$ ? For which $\theta$ is the process invertible?

ExerciseAdvanced

Problem

Two MA(1) processes with parameters $\theta$ and $1/\theta$ (and noise variances $\sigma^2$ and $\theta^2 \sigma^2$ respectively) have the same autocovariance function. Verify this and explain why invertibility is an identifiability constraint, not a stationarity constraint.

ExerciseAdvanced

Problem

Show that the Wold innovations $\epsilon_t$ in the decomposition $X_t = \sum_j \psi_j \epsilon_{t-j} + V_t$ are uncorrelated but need not be independent. Construct a covariance-stationary process whose Wold innovations are dependent.

References

Canonical:

Box, G. E. P., Jenkins, G. M., Reinsel, G. C., Ljung, G. M. Time Series Analysis: Forecasting and Control, 5th ed., Wiley, 2015 (originally 1970), Chapters 3-5.
Hamilton, J. D. Time Series Analysis, Princeton University Press, 1994, Chapters 3, 4, 6, 17.
Brockwell, P. J., Davis, R. A. Time Series: Theory and Methods, 2nd ed., Springer, 1991, Chapters 3, 4, 5.
Wold, H. A Study in the Analysis of Stationary Time Series, Almqvist & Wiksell, Stockholm, 1938 (foundational; Wold decomposition).
Dickey, D. A., Fuller, W. A. "Distribution of the Estimators for Autoregressive Time Series with a Unit Root." Journal of the American Statistical Association, 74(366), 1979.

Current:

Shumway, R. H., Stoffer, D. S. Time Series Analysis and Its Applications: With R Examples, 4th ed., Springer, 2017, Chapters 1-4.
Hyndman, R. J., Athanasopoulos, G. Forecasting: Principles and Practice, 3rd ed., OTexts, 2021, Chapters 8-9.
Tsay, R. S. Analysis of Financial Time Series, 3rd ed., Wiley, 2010, Chapters 2-3.
Kwiatkowski, D., Phillips, P. C. B., Schmidt, P., Shin, Y. "Testing the Null Hypothesis of Stationarity Against the Alternative of a Unit Root." Journal of Econometrics, 54(1-3), 1992.

Next Topics

State space models: every ARIMA has a state space representation; Kalman filtering subsumes ARMA likelihood evaluation.
Time series forecasting basics: practical model selection, exponential smoothing, modern foundation models.
Deep learning for time series: how LSTMs, TCNs, and PatchTST relate to (and depart from) the ARMA framework.
Stochastic processes for ML: continuous-time analogues, sample-path regularity, martingale methods.

Last reviewed: May 6, 2026

Canonical graph

Required before and derived from this topic

These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.

Full prerequisite chain All derived topics

Required prerequisites

3

Expectation, Variance, Covariance, and Momentslayer 0A · tier 1
Kolmogorov Probability Axiomslayer 0A · tier 1
Stochastic Processes for MLlayer 2 · tier 2

Derived topics

2

State Space Modelslayer 2 · tier 2
Deep Learning for Time Serieslayer 3 · tier 2

Graph-backed continuations

State Space Models