Beta. Content is under active construction and has not been peer-reviewed. Report errors on GitHub.Disclaimer

ML Methods

Time Series Forecasting Basics

Stationarity, autocorrelation, AR, MA, ARIMA, exponential smoothing, and why classical methods still beat deep learning on many forecasting benchmarks.

CoreTier 2Stable~55 min

Prerequisites

0

Why This Matters

Time series data is everywhere: stock prices, server metrics, weather, sensor readings, demand forecasting. The temporal structure (ordering, autocorrelation, trends, seasonality) makes time series different in kind from i.i.d. tabular data. Methods that ignore this structure fail. Methods that exploit it, even simple classical ones, often outperform complex deep learning approaches on standard benchmarks.

Fundamental Concepts

Definition

Stationarity

A time series {Xt}\{X_t\} is (weakly) stationary if:

  1. E[Xt]=μ\mathbb{E}[X_t] = \mu for all tt (constant mean)
  2. Var(Xt)=σ2\text{Var}(X_t) = \sigma^2 for all tt (constant variance)
  3. Cov(Xt,Xt+h)=γ(h)\text{Cov}(X_t, X_{t+h}) = \gamma(h) depends only on lag hh, not on tt

Stationarity means the statistical properties do not change over time. Most forecasting methods assume stationarity or require transforming the data to achieve it.

Definition

Autocorrelation Function

The autocorrelation function (ACF) at lag hh is:

ρ(h)=γ(h)γ(0)=Cov(Xt,Xt+h)Var(Xt)\rho(h) = \frac{\gamma(h)}{\gamma(0)} = \frac{\text{Cov}(X_t, X_{t+h})}{\text{Var}(X_t)}

The ACF measures the linear dependence between a time series and its lagged values. The partial autocorrelation function (PACF) at lag hh measures the correlation between XtX_t and Xt+hX_{t+h} after removing the effect of intermediate lags Xt+1,,Xt+h1X_{t+1}, \ldots, X_{t+h-1}.

Autoregressive Models

Definition

AR(p) Model

An autoregressive model of order pp is:

Xt=c+ϕ1Xt1+ϕ2Xt2++ϕpXtp+ϵtX_t = c + \phi_1 X_{t-1} + \phi_2 X_{t-2} + \cdots + \phi_p X_{t-p} + \epsilon_t

where cc is a constant, ϕ1,,ϕp\phi_1, \ldots, \phi_p are parameters, and ϵtWN(0,σ2)\epsilon_t \sim \text{WN}(0, \sigma^2) is white noise. The current value is a linear combination of pp past values plus noise.

Theorem

AR(p) Stationarity Condition

Statement

An AR(pp) process Xt=ϕ1Xt1++ϕpXtp+ϵtX_t = \phi_1 X_{t-1} + \cdots + \phi_p X_{t-p} + \epsilon_t is stationary if and only if all roots of the characteristic polynomial

1ϕ1zϕ2z2ϕpzp=01 - \phi_1 z - \phi_2 z^2 - \cdots - \phi_p z^p = 0

lie outside the unit circle in the complex plane (zi>1|z_i| > 1 for all roots ziz_i).

Intuition

For AR(1), the condition reduces to ϕ1<1|\phi_1| < 1. If ϕ11|\phi_1| \geq 1, shocks accumulate rather than decay, and the process drifts or explodes. For higher orders, the characteristic polynomial encodes how past values combine; roots inside the unit circle correspond to exponentially growing components.

Proof Sketch

Write the AR(pp) in lag operator notation: Φ(L)Xt=ϵt\Phi(L) X_t = \epsilon_t where Φ(L)=1ϕ1LϕpLp\Phi(L) = 1 - \phi_1 L - \cdots - \phi_p L^p. The process has a causal (forward-looking) representation Xt=Φ(L)1ϵt=j=0ψjϵtjX_t = \Phi(L)^{-1} \epsilon_t = \sum_{j=0}^\infty \psi_j \epsilon_{t-j} if and only if Φ(z)0\Phi(z) \neq 0 for z1|z| \leq 1. The ψj\psi_j coefficients decay geometrically, ensuring finite variance.

Why It Matters

Before fitting an AR model, you must check stationarity. If the series has a unit root (zi=1|z_i| = 1), the process is a random walk and AR estimation is inconsistent. Differencing the series (ARIMA) addresses this. The Augmented Dickey-Fuller test checks for unit roots.

Failure Mode

Fitting AR to a non-stationary series gives spurious parameter estimates. The OLS coefficients converge to their true values at rate nn instead of n\sqrt{n} (superconsistency), but inference (confidence intervals, tests) is invalid. Standard t-statistics do not follow the t-distribution; you need the Dickey-Fuller distribution instead.

Moving Average Models

Definition

MA(q) Model

A moving average model of order qq is:

Xt=μ+ϵt+θ1ϵt1++θqϵtqX_t = \mu + \epsilon_t + \theta_1 \epsilon_{t-1} + \cdots + \theta_q \epsilon_{t-q}

where ϵtWN(0,σ2)\epsilon_t \sim \text{WN}(0, \sigma^2). The current value depends on qq past error terms. MA models are always stationary (finite sums of stationary processes are stationary).

The Wold Decomposition

Theorem

Wold Decomposition Theorem

Statement

Any covariance-stationary process {Xt}\{X_t\} can be written as:

Xt=j=0ψjϵtj+DtX_t = \sum_{j=0}^{\infty} \psi_j \epsilon_{t-j} + D_t

where ψ0=1\psi_0 = 1, j=0ψj2<\sum_{j=0}^\infty \psi_j^2 < \infty, ϵt\epsilon_t is white noise, and DtD_t is a deterministic component (perfectly predictable from its own past). The MA(\infty) part is unique.

Intuition

Every stationary process is, in a precise sense, an infinite moving average plus a deterministic trend. This justifies the use of ARMA models: AR and MA are two different finite-parameter approximations to the general MA(\infty) representation.

Proof Sketch

Project XtX_t onto the closed linear span of its own past innovations. The projection gives the MA(\infty) component. The residual, being orthogonal to all past innovations, is deterministic (perfectly predictable from past values). The ψj\psi_j coefficients are the Wold representation coefficients.

Why It Matters

The Wold theorem provides the theoretical foundation for ARMA modeling. It says that ARMA is not just a convenient parametric family but a natural finite-parameter approximation to the true data-generating process.

Failure Mode

The Wold decomposition assumes stationarity. Non-stationary processes (trending, unit root) must be transformed (differenced) first. Also, the decomposition is linear. Nonlinear dependencies (volatility clustering, regime switching) are invisible to Wold and require GARCH or regime-switching models.

ARIMA

ARIMA(p,d,qp, d, q) combines autoregression, differencing, and moving average:

  1. Differencing dd times to achieve stationarity: ΔdXt=(1L)dXt\Delta^d X_t = (1-L)^d X_t
  2. Fit an ARMA(p,qp, q) to the differenced series

For seasonal data, SARIMA adds seasonal terms: SARIMA(p,d,qp,d,q)(P,D,QP,D,Q)s_s with seasonal period ss.

Model selection: use the ACF to identify qq (ACF cuts off after lag qq for MA(qq)) and PACF to identify pp (PACF cuts off after lag pp for AR(pp)). In practice, use AIC or BIC to select among candidate models.

Exponential Smoothing

Simple exponential smoothing forecasts using a weighted average of all past observations with exponentially decaying weights:

X^t+1=αXt+(1α)X^t,0<α<1\hat{X}_{t+1} = \alpha X_t + (1 - \alpha) \hat{X}_t, \quad 0 < \alpha < 1

Holt's method adds a trend component. Holt-Winters adds seasonality. The ETS (Error-Trend-Seasonal) framework encompasses all exponential smoothing variants with automatic model selection.

Exponential smoothing is competitive with ARIMA on many datasets and is computationally trivial. It has a state-space representation that provides prediction intervals.

Modern Approaches

Prophet (Taylor & Letham, 2018): decomposable model with trend, seasonality, and holidays. Designed for business forecasting with irregular holidays and missing data. Uses Stan for Bayesian inference.

N-BEATS: deep learning architecture with backward and forward residual links. Interpretable variant decomposes forecasts into trend and seasonality.

Temporal Fusion Transformer: attention-based model handling multiple time series with static covariates, known future inputs, and observed past inputs. State-of-the-art on several multi-horizon benchmarks.

Classical vs. Deep Learning

The Makridakis competitions (M3, M4, M5) and subsequent studies consistently show that simple methods (exponential smoothing, ARIMA, theta method) match or beat complex deep learning methods on univariate forecasting. Deep learning methods excel when: the dataset has many related time series (enabling cross-series learning), rich exogenous variables are available, or the series is long enough to train large models.

The failure of deep learning on short univariate series is not surprising: ARIMA has O(p+q)O(p+q) parameters, while a transformer has millions. With 100 observations, the classical model wins by not overfitting.

Common Confusions

Watch Out

Stationarity does not mean constant

A stationary series fluctuates around a fixed mean with constant variance. It can have substantial variation. Stationarity means the statistics of the fluctuations do not change over time, not that the series itself is flat.

Watch Out

Differencing is not detrending

Differencing removes stochastic trends (unit roots). Detrending removes deterministic trends (fitting and subtracting a trend line). Applying the wrong one gives incorrect results: detrending a unit root process leaves residuals that are still non-stationary; differencing a trend-stationary process introduces an unnecessary MA unit root.

Watch Out

Good in-sample fit does not mean good forecasts

Overfitting is particularly dangerous in time series because the effective sample size is smaller than the number of observations (autocorrelation reduces information content). Always evaluate forecasts on a held-out future period, not on the training period.

Key Takeaways

  • Stationarity (constant mean, variance, autocovariance) is the core assumption; test for it before modeling
  • AR(pp) captures dependence on past values; MA(qq) captures dependence on past errors; ARIMA combines both with differencing
  • The Wold theorem justifies ARMA as a universal approximation for stationary processes
  • Exponential smoothing is simple, effective, and has a rigorous state-space formulation
  • Classical methods beat deep learning on many univariate forecasting benchmarks, especially with short series

Exercises

ExerciseCore

Problem

An AR(1) model has ϕ1=0.8\phi_1 = 0.8. What is the autocorrelation at lag 3? Is this process stationary?

ExerciseAdvanced

Problem

You observe a time series that appears non-stationary. You difference it once (d=1d=1) and the resulting series passes the ADF test for stationarity. The ACF of the differenced series cuts off after lag 1, and the PACF decays gradually. What ARIMA model should you fit? Justify your choice.

References

Canonical:

  • Box, Jenkins, Reinsel, Time Series Analysis (5th ed.), Chapters 3-5
  • Hamilton, Time Series Analysis (1994), Chapters 3-4

Current:

  • Hyndman & Athanasopoulos, Forecasting: Principles and Practice (3rd ed.), Chapters 8-9
  • Makridakis et al., "The M4 Competition" (2020)
  • Taylor & Letham, "Forecasting at Scale" (Prophet, 2018)

Next Topics

Last reviewed: April 2026

Prerequisites

Foundations this topic depends on.

Next Topics