Skip to main content

ML Methods

Deep Learning for Time Series

RNN/LSTM, Temporal Convolutional Networks, and Transformers (PatchTST, Informer, Autoformer) for time series forecasting, with N-BEATS basis decomposition and the Zeng et al. linear-baseline controversy.

AdvancedTier 2CurrentSupporting~60 min

Why This Matters

Classical time series methods (ARIMA, ETS) handle individual univariate series well and remain hard to beat on short, simple data. Deep learning enters the picture when one of three conditions holds: many related series share structure (electricity demand across thousands of meters, retail sales across thousands of SKUs); the input is high-dimensional (multivariate sensor streams, exogenous covariates); or the series is long enough to support a high-capacity model. The interesting question is which deep architecture matches which data regime.

This page surveys the four main families: recurrent (LSTM, GRU), convolutional (TCN), attention-based (PatchTST, Informer, Autoformer), and basis-expansion (N-BEATS, N-HiTS). It ends with the Zeng et al. (2022) result that a one-line linear baseline beat then-state-of-the-art Transformers on every standard benchmark. The deep-learning answer to time series is more nuanced than the deep-learning answer to vision or language, and it is worth understanding why.

Recurrent Models for Time Series

The LSTM was the workhorse of sequence modeling from roughly 2014 to 2018 and remains a reasonable baseline. For time series specifically:

  • Forget gate as learned differencing. The forget gate ft=σ(Wf[ht1,xt]+bf)f_t = \sigma(W_f [h_{t-1}, x_t] + b_f) controls how much of the cell state ct1c_{t-1} persists. With ft1f_t \approx 1 the cell integrates inputs (a learned random walk); with ft0f_t \approx 0 the cell resets (a learned change-point detector). LSTMs implicitly learn the differencing order an ARIMA modeler would set by hand.
  • Nonlinear updates. Unlike ARIMA, the LSTM update is nonlinear; it can fit volatility clustering, regime switches, and amplitude-dependent dynamics that linear models miss.
  • Vanishing-gradient mitigation, not elimination. The cell-state additive path lets gradients flow further than in vanilla RNNs, but the gate Jacobians still multiply along the sequence. For sequences over a few hundred steps, gradient flow degrades; very long-range dependencies stay hard.

The empirical record on time series benchmarks is mixed. LSTMs do well on data with rich short-to-medium-range structure (energy load, traffic) and poorly on long-horizon univariate forecasting (M3, M4, M5 competitions), where the inductive bias of state space and ETS models typically wins.

Temporal Convolutional Networks

A TCN (Bai, Kolter, Koltun 2018) replaces recurrence with stacked dilated causal 1D convolutions. Three structural choices define the architecture:

  1. Causal: the output at time tt depends only on inputs t\leq t. Achieved by left-padding the input by (k1)(k-1) for kernel size kk.
  2. Dilated: layer ll uses dilation dl=2l1d_l = 2^{l-1}. Skipping inputs at exponentially growing rates lets stack depth scale logarithmically with receptive field.
  3. Residual: each block is a residual connection around two dilated convolutions, stabilizing training of deep stacks.
Proposition

TCN Receptive Field

Statement

A TCN with kernel size kk, dilation dl=2l1d_l = 2^{l-1} at layer ll, and LL layers has receptive field R=1+(k1)l=1Ldl=1+(k1)(2L1).R = 1 + (k-1)\sum_{l=1}^L d_l = 1 + (k-1)(2^L - 1). The receptive field grows exponentially in depth and linearly in (k1)(k-1).

Intuition

A single layer with kernel kk sees kk inputs. Stacking another layer with dilation dld_l extends the view by (k1)dl(k-1) d_l inputs (the new layer's kernel covers kk outputs of the previous layer, each of which has an effective spread of dld_l original inputs). Summing the geometric series gives 2L12^L - 1.

Proof Sketch

By induction on LL. For L=1L=1: receptive field =k=1+(k1)= k = 1 + (k-1). Inductive step: a layer with kernel kk and dilation dd on top of a stack with receptive field RR' has receptive field R+(k1)dR' + (k-1)d. Summing the recurrence with dl=2l1d_l = 2^{l-1} telescopes via the geometric sum.

Why It Matters

With k=3k = 3 and L=10L = 10 layers you get R=1+2(2101)=2047R = 1 + 2(2^{10} - 1) = 2047. Eleven layers cover roughly four thousand timesteps. Fewer layers, much larger receptive field than a stacked LSTM at comparable cost. This is why TCNs often beat LSTMs on long-range benchmarks. The architecture removes the sequential bottleneck that limits gradient flow in recurrent stacks.

Failure Mode

Receptive field is necessary but not sufficient. The effective receptive field (the region the model actually uses, weighted by gradient magnitudes) is typically a Gaussian around the present, much smaller than the theoretical receptive field (Luo et al. 2016 for CNNs; the same pattern holds for TCNs). Doubling the layer count multiplies parameters without proportionally improving long-range coverage in practice.

TCNs train with the parallelism of CNNs (no sequential bottleneck) and have stable, well-understood gradient flow. They tend to outperform LSTMs on long-range tasks at comparable parameter counts and are a strong baseline for forecasting.

Transformers for Time Series

Self-attention computes pairwise interactions between all positions in a sequence at O(T2)O(T^2) time and memory. For long forecasting horizons this is the dominant cost, and most of the architectural variation in time-series Transformers is about reducing it.

Vanilla attention is wrong for time series

Stacked attention over per-timestep tokens has two problems beyond the O(T2)O(T^2) cost. First, time series are typically smooth: adjacent timesteps carry highly correlated information, and the attention mechanism wastes capacity computing trivial pairs. Second, attention is permutation-equivariant up to position embeddings; the locality and ordering that linear models exploit by construction must be re-learned by the network.

PatchTST: subsampling via patches

Proposition

PatchTST Patching Reduces Attention Cost

Statement

Splitting a length-TT series into T/P\lceil T/P \rceil non-overlapping patches of length PP and treating each patch as a token reduces attention complexity from O(T2)O(T^2) to O((T/P)2)O((T/P)^2), a factor of P2P^2 improvement. Each token now contains PP raw values, preserving local information while shortening the sequence the attention layer sees.

Intuition

Time series have strong local autocorrelation. Adjacent timesteps don't need to attend to each other through the global attention mechanism; they can be packed into the same token and let MLP layers handle local mixing. PatchTST (Nie et al. 2023) found P=16P = 16 works well in practice across the standard benchmarks.

Proof Sketch

The attention layer sees T/P\lceil T/P \rceil tokens; pairwise interactions cost (T/P)2(T/P)^2. The patch-internal information is preserved in the per-patch embedding (a linear projection from RP\mathbb{R}^P to Rd\mathbb{R}^d).

Why It Matters

Patching is the trick that finally let Transformers compete on time series benchmarks. PatchTST also adopted channel independence (fitting a separate model per variate), which sidesteps the curse of dimensionality in multivariate datasets. Combined, these two tricks gave the best Transformer-based numbers on long-horizon forecasting at the time of publication.

Failure Mode

Patching loses fine-grained timing information at boundaries: a patch boundary at the wrong place can hide a sharp transition. Channel independence loses cross-variate dependencies, which matter when variables genuinely couple (e.g., correlated assets in finance). iTransformer (Liu et al. 2024) reverses the choice (attending over variates and treating time as the channel) to recover cross-variate structure.

Informer: ProbSparse attention

Informer (Zhou et al. 2021) reduces attention cost by a different route. The attention pattern in a trained model is empirically sparse: only a few queries attend strongly to a few keys. ProbSparse selects the top-uu queries by KL-divergence-based scoring (measuring how spiky their attention distribution is) and computes full attention only for them, treating the rest as a uniform attention. Cost drops to O(TlogT)O(T \log T).

Autoformer: auto-correlation decomposition

Autoformer (Wu et al. 2021) replaces self-attention with an auto-correlation mechanism that uses the Fourier transform to find similar sub-series and aggregate them. The architecture also performs an explicit trend-seasonality decomposition between blocks, reflecting the classical structural-time-series view that a series factors into level + trend + seasonality + residual.

N-BEATS and N-HiTS: basis expansion

N-BEATS (Oreshkin et al. 2020) is structurally different: it learns a stack of fully-connected blocks that produce backward (reconstruction) and forward (forecast) basis coefficients. Each block subtracts its backward output from the input and passes the residual to the next block. The interpretable variant constrains some blocks to polynomial trend bases and others to Fourier seasonality bases, recovering a learned analogue of classical decomposition.

N-HiTS (Challu et al. 2023) extends this with multi-rate signal sampling and hierarchical interpolation, addressing N-BEATS's tendency to overfit on long horizons.

The basis-expansion view connects deep learning to classical decomposition: an ETS model is a state space model with trend, level, and seasonal components; N-BEATS learns analogous components without prespecifying them.

The Linear-Baseline Controversy

In 2022, Zeng, Chen, Zhang, and Xu posted Are Transformers Effective for Time Series Forecasting? (arXiv:2205.13504). They proposed three simple linear baselines (Linear, NLinear, DLinear), each with at most a single linear layer that maps the lookback window to the forecast horizon, and showed that they matched or beat all then-published time series Transformers on the standard ETT, Electricity, Traffic, Weather, ILI, and Exchange benchmarks. The paper drew a strong response.

The structural argument is straightforward: a linear forecaster with a long lookback window captures the same low-frequency, autocorrelated structure that drives most of the variance in standard benchmarks, and standard benchmarks have relatively short forecast horizons compared to the dominant timescales. A model with millions of parameters fitting this structure has a much bigger overfitting surface than a single linear layer; the inductive bias of a transformer with quadratic attention is misaligned with the low-frequency, locally smooth signal.

PatchTST (also 2022) acknowledged the result and showed that with patching and channel independence, Transformers regain a small but real edge on the same benchmarks. iTransformer (2023) went further. The current consensus is roughly:

  • For univariate, low-frequency, mostly-stationary forecasting: simple methods (linear baselines, ETS, ARIMA) are very hard to beat with deep models.
  • For multivariate forecasting with cross-variate structure or rich exogenous covariates: modern Transformer variants (PatchTST, iTransformer) and TCN-style models can win, often modestly.
  • For very long sequences with long-range dependencies: state space-based models (S4, Mamba) and TCNs have a structural advantage, both in efficiency and in capturing slowly-decaying autocorrelation.
  • For zero-shot / cross-series transfer: time-series foundation models (TimesFM, Chronos, Moirai, trained on diverse series) operate in a regime classical methods cannot reach.

The Zeng et al. result is a useful epistemic caution. It says: when a benchmark is dominated by structure that a linear model captures, a small linear model is the right tool, and a more complex model that does not beat it is providing zero scientific signal. The same caution applies more broadly. The field's benchmark culture has been slow to design tasks where deep models clearly dominate, and the absence of a benchmark advantage often means the absence of a real advantage.

Generalization for Sequence Predictors

The classical learning-theory bounds (Rademacher complexity, VC dimension) give risk bounds in the i.i.d. setting. Time series violate this: training samples are correlated, and the natural notion of "sample size" is the effective sample size, which depends on the mixing rate of the process.

Two adaptations are commonly used. Beta-mixing bounds (Yu 1994; Mohri and Rostamizadeh 2010) replace independence with β\beta-mixing: a measure of how fast the process forgets its past. The generalization bound has the same shape as the i.i.d. bound but with nn replaced by an effective sample size n/(mixing time)n / (\text{mixing time}). Block-bootstrap bounds apply concentration inequalities to non-overlapping blocks rather than individual observations, which works as long as inter-block dependence is weak.

These bounds are loose in practice; tight finite-sample analysis of deep sequence models on time series is an open area. The empirical processes and Rademacher complexity machinery from learning theory adapts but does not transfer directly.

Common Confusions

Watch Out

Long context window is not the same as long memory

A model with a 4096-token context window and quadratic attention can see 4096 tokens but does not necessarily use most of them. The effective receptive field of a Transformer is concentrated near the prediction position; the tail of the lookback contributes little. For forecasting, this means doubling the lookback often gives diminishing returns even though the model is "looking at" twice the data.

Watch Out

Channel independence is not channel ablation

PatchTST fits a separate model per variate but shares the same parameters across variates. This is channel-independent inference but parameter-sharing training. It is not the same as fitting one model per variate from scratch (which would be channel-independent training too) and not the same as throwing away cross-variate structure entirely. Cross-variate structure is recoverable by mixing channels at the loss or output stage; PatchTST chose not to.

Watch Out

Beating ARIMA on a long lookback is not the same as beating it on a short one

M5-style benchmarks use lookback windows of hundreds of timesteps. ARIMA shines on short lookbacks because its strong inductive bias compensates for limited data. As lookback grows, deep models gain ground. Reporting only one operating point misrepresents the comparison; the proper benchmark is the full lookback-vs-error curve.

Summary

  • LSTMs treat time series with learned differencing and nonlinear updates; they handle short-to-medium-range structure well but not long-range.
  • TCNs use stacked dilated causal convolutions; receptive field grows exponentially in depth, with parallel training and stable gradients.
  • Transformer variants for time series (PatchTST, Informer, Autoformer) reduce O(T2)O(T^2) attention cost via patching, sparse attention, or auto-correlation; PatchTST + channel independence is the strongest pure-Transformer baseline.
  • N-BEATS / N-HiTS use learned basis decomposition, recovering an analogue of classical level-trend-seasonality.
  • Zeng et al. (2022) showed simple linear models match or beat then-state-of-the-art Transformers on standard benchmarks. The field updated, but the general lesson holds: match model complexity to the scale of the actual signal.
  • Generalization theory for sequences uses β\beta-mixing and effective sample sizes; the bounds are loose and the area is open.

Exercises

ExerciseCore

Problem

A TCN uses kernel size k=3k = 3 with dilations dl=2l1d_l = 2^{l-1} at L=8L = 8 layers. What is the receptive field? How many layers LL are needed to cover a receptive field of 8000\geq 8000?

ExerciseAdvanced

Problem

Argue informally why a single linear layer with a long lookback can match a Transformer with patching on a benchmark whose forecast horizon is short relative to the dominant autocorrelation timescale of the series. Where would you expect the Transformer to actually win?

ExerciseAdvanced

Problem

PatchTST attention has cost O((T/P)2)O((T/P)^2) per layer. Suppose your data has T=4096T = 4096 and you use P=16P = 16 patches; compute the speedup over O(T2)O(T^2) vanilla attention. What does this buy you in practice: wall-clock time, memory, or both? Discuss when the patch length PP should be small versus large.

References

Canonical:

  • Hochreiter, S., Schmidhuber, J. "Long Short-Term Memory." Neural Computation, 9(8), 1997.
  • Bai, S., Kolter, J. Z., Koltun, V. "An Empirical Evaluation of Generic Convolutional and Recurrent Networks for Sequence Modeling," arXiv:1803.01271, 2018 (TCN).
  • Vaswani, A. et al. "Attention Is All You Need." NeurIPS 2017.

Current:

  • Nie, Y., Nguyen, N. H., Sinthong, P., Kalagnanam, J. "A Time Series is Worth 64 Words: Long-term Forecasting with Transformers (PatchTST)," arXiv:2211.14730, 2023.
  • Zeng, A., Chen, M., Zhang, L., Xu, Q. "Are Transformers Effective for Time Series Forecasting?" arXiv:2205.13504, 2022.
  • Liu, Y. et al. "iTransformer: Inverted Transformers Are Effective for Time Series Forecasting," arXiv:2310.06625, 2024.
  • Zhou, H. et al. "Informer: Beyond Efficient Transformer for Long Sequence Time-Series Forecasting." AAAI 2021.
  • Wu, H., Xu, J., Wang, J., Long, M. "Autoformer: Decomposition Transformers with Auto-Correlation for Long-Term Series Forecasting." NeurIPS 2021.
  • Oreshkin, B. N., Carpov, D., Chapados, N., Bengio, Y. "N-BEATS: Neural Basis Expansion Analysis for Interpretable Time Series Forecasting," ICLR 2020.
  • Challu, C. et al. "N-HiTS: Neural Hierarchical Interpolation for Time Series Forecasting." AAAI 2023.
  • Mohri, M., Rostamizadeh, A. "Stability Bounds for Stationary varphi\\varphi-mixing and beta\\beta-mixing Processes." JMLR 11, 2010.

Next Topics

Last reviewed: May 6, 2026

Canonical graph

Required before and derived from this topic

These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.

Required prerequisites

4

Derived topics

0

No published topic currently declares this as a prerequisite.