ML Methods
Time Series Forecasting Basics
Stationarity, autocorrelation, AR, MA, ARIMA, exponential smoothing, and why classical methods still beat deep learning on many forecasting benchmarks.
Prerequisites
Why This Matters
Time series data is everywhere: stock prices, server metrics, weather, sensor readings, demand forecasting. The temporal structure (ordering, autocorrelation, trends, seasonality) makes time series different in kind from i.i.d. tabular data. Methods that ignore this structure fail. Methods that exploit it, even simple classical ones, often outperform complex deep learning approaches on standard benchmarks.
Fundamental Concepts
Stationarity
A time series is (weakly) stationary if:
- for all (constant mean)
- for all (constant variance)
- depends only on lag , not on
Stationarity means the statistical properties do not change over time. Most forecasting methods assume stationarity or require transforming the data to achieve it.
Autocorrelation Function
The autocorrelation function (ACF) at lag is:
The ACF measures the linear dependence between a time series and its lagged values. The partial autocorrelation function (PACF) at lag measures the correlation between and after removing the effect of intermediate lags .
Autoregressive Models
AR(p) Model
An autoregressive model of order is:
where is a constant, are parameters, and is white noise. The current value is a linear combination of past values plus noise.
AR(p) Stationarity Condition
Statement
An AR() process is stationary if and only if all roots of the characteristic polynomial
lie outside the unit circle in the complex plane ( for all roots ).
Intuition
For AR(1), the condition reduces to . If , shocks accumulate rather than decay, and the process drifts or explodes. For higher orders, the characteristic polynomial encodes how past values combine; roots inside the unit circle correspond to exponentially growing components.
Proof Sketch
Write the AR() in lag operator notation: where . The process has a causal (forward-looking) representation if and only if for . The coefficients decay geometrically, ensuring finite variance.
Why It Matters
Before fitting an AR model, you must check stationarity. If the series has a unit root (), the process is a random walk and AR estimation is inconsistent. Differencing the series (ARIMA) addresses this. The Augmented Dickey-Fuller test checks for unit roots.
Failure Mode
Fitting AR to a non-stationary series gives spurious parameter estimates. The OLS coefficients converge to their true values at rate instead of (superconsistency), but inference (confidence intervals, tests) is invalid. Standard t-statistics do not follow the t-distribution; you need the Dickey-Fuller distribution instead.
Moving Average Models
MA(q) Model
A moving average model of order is:
where . The current value depends on past error terms. MA models are always stationary (finite sums of stationary processes are stationary).
The Wold Decomposition
Wold Decomposition Theorem
Statement
Any covariance-stationary process can be written as:
where , , is white noise, and is a deterministic component (perfectly predictable from its own past). The MA() part is unique.
Intuition
Every stationary process is, in a precise sense, an infinite moving average plus a deterministic trend. This justifies the use of ARMA models: AR and MA are two different finite-parameter approximations to the general MA() representation.
Proof Sketch
Project onto the closed linear span of its own past innovations. The projection gives the MA() component. The residual, being orthogonal to all past innovations, is deterministic (perfectly predictable from past values). The coefficients are the Wold representation coefficients.
Why It Matters
The Wold theorem provides the theoretical foundation for ARMA modeling. It says that ARMA is not just a convenient parametric family but a natural finite-parameter approximation to the true data-generating process.
Failure Mode
The Wold decomposition assumes stationarity. Non-stationary processes (trending, unit root) must be transformed (differenced) first. Also, the decomposition is linear. Nonlinear dependencies (volatility clustering, regime switching) are invisible to Wold and require GARCH or regime-switching models.
ARIMA
ARIMA() combines autoregression, differencing, and moving average:
- Differencing times to achieve stationarity:
- Fit an ARMA() to the differenced series
For seasonal data, SARIMA adds seasonal terms: SARIMA()() with seasonal period .
Model selection: use the ACF to identify (ACF cuts off after lag for MA()) and PACF to identify (PACF cuts off after lag for AR()). In practice, use AIC or BIC to select among candidate models.
Exponential Smoothing
Simple exponential smoothing forecasts using a weighted average of all past observations with exponentially decaying weights:
Holt's method adds a trend component. Holt-Winters adds seasonality. The ETS (Error-Trend-Seasonal) framework encompasses all exponential smoothing variants with automatic model selection.
Exponential smoothing is competitive with ARIMA on many datasets and is computationally trivial. It has a state-space representation that provides prediction intervals.
Modern Approaches
Prophet (Taylor & Letham, 2018): decomposable model with trend, seasonality, and holidays. Designed for business forecasting with irregular holidays and missing data. Uses Stan for Bayesian inference.
N-BEATS: deep learning architecture with backward and forward residual links. Interpretable variant decomposes forecasts into trend and seasonality.
Temporal Fusion Transformer: attention-based model handling multiple time series with static covariates, known future inputs, and observed past inputs. State-of-the-art on several multi-horizon benchmarks.
Classical vs. Deep Learning
The Makridakis competitions (M3, M4, M5) and subsequent studies consistently show that simple methods (exponential smoothing, ARIMA, theta method) match or beat complex deep learning methods on univariate forecasting. Deep learning methods excel when: the dataset has many related time series (enabling cross-series learning), rich exogenous variables are available, or the series is long enough to train large models.
The failure of deep learning on short univariate series is not surprising: ARIMA has parameters, while a transformer has millions. With 100 observations, the classical model wins by not overfitting.
Common Confusions
Stationarity does not mean constant
A stationary series fluctuates around a fixed mean with constant variance. It can have substantial variation. Stationarity means the statistics of the fluctuations do not change over time, not that the series itself is flat.
Differencing is not detrending
Differencing removes stochastic trends (unit roots). Detrending removes deterministic trends (fitting and subtracting a trend line). Applying the wrong one gives incorrect results: detrending a unit root process leaves residuals that are still non-stationary; differencing a trend-stationary process introduces an unnecessary MA unit root.
Good in-sample fit does not mean good forecasts
Overfitting is particularly dangerous in time series because the effective sample size is smaller than the number of observations (autocorrelation reduces information content). Always evaluate forecasts on a held-out future period, not on the training period.
Key Takeaways
- Stationarity (constant mean, variance, autocovariance) is the core assumption; test for it before modeling
- AR() captures dependence on past values; MA() captures dependence on past errors; ARIMA combines both with differencing
- The Wold theorem justifies ARMA as a universal approximation for stationary processes
- Exponential smoothing is simple, effective, and has a rigorous state-space formulation
- Classical methods beat deep learning on many univariate forecasting benchmarks, especially with short series
Exercises
Problem
An AR(1) model has . What is the autocorrelation at lag 3? Is this process stationary?
Problem
You observe a time series that appears non-stationary. You difference it once () and the resulting series passes the ADF test for stationarity. The ACF of the differenced series cuts off after lag 1, and the PACF decays gradually. What ARIMA model should you fit? Justify your choice.
References
Canonical:
- Box, Jenkins, Reinsel, Time Series Analysis (5th ed.), Chapters 3-5
- Hamilton, Time Series Analysis (1994), Chapters 3-4
Current:
- Hyndman & Athanasopoulos, Forecasting: Principles and Practice (3rd ed.), Chapters 8-9
- Makridakis et al., "The M4 Competition" (2020)
- Taylor & Letham, "Forecasting at Scale" (Prophet, 2018)
Next Topics
- Gaussian processes for ML: a nonparametric approach to time series and regression with uncertainty
Last reviewed: April 2026
Prerequisites
Foundations this topic depends on.
- Linear RegressionLayer 1
- Matrix Operations and PropertiesLayer 0A
- Sets, Functions, and RelationsLayer 0A
- Basic Logic and Proof TechniquesLayer 0A
- Maximum Likelihood EstimationLayer 0B
- Common Probability DistributionsLayer 0A
- Differentiation in RnLayer 0A