Recurrent Neural Networks

Sneiderman, Robby

ML Methods

Recurrent Neural Networks

Sequential processing via hidden state recurrence: the simple RNN, vanishing and exploding gradients, LSTM gating mechanisms, and why transformers have largely replaced RNNs.

AdvancedTier 2StableSupporting~55 min

Prerequisites

Feedforward Networks and Backpropagation Convolutional Neural Networks Macroeconomic Time Series Forecasting

Start 8-question practice · 3 available Prereq Map

Learning position

Read this page in the graph.

ml-methods | layer 3 | tier 2. This page has 3 direct prerequisites and 9 published dependents.

Open Atlas Prerequisites Leads to

What next

Transformer Architecture

This is the first curated or graph-derived continuation from the current page.

Evidence badge

Source-grounded page

This page has no public Lean mapping yet. Use the evidence page to inspect how claim status labels work.

Show the backing system

AtlasOpen the full prerequisite graph and run grounding traces.EvidenceInspect source support, claim labels, and public trust status.LeanReview the checked declaration list, scopes, and axiom profile.

Why This Matters

RNNs are the conceptual foundation for sequence modeling in deep learning. Even though transformers have largely replaced them in practice, understanding RNNs is essential because: (1) the vanishing gradient problem they suffer from motivates the design of LSTMs, GRUs, and ultimately attention mechanisms; (2) gating mechanisms invented for LSTMs appear throughout modern architectures; (3) RNN-like state machines are resurgent in efficient inference (state space models, linear attention).

Mental Model

A feedforward network processes each input independently. An RNN maintains a hidden state $h_t$ that summarizes everything seen so far. At each time step, the new hidden state is a function of the previous state and the current input. This is a learned dynamical system. The hidden state is the systems memory.

The Simple RNN

Definition

Simple (Elman) RNN

The simple RNN update rule is:

$h_t = \tanh(W_h h_{t-1} + W_x x_t + b)$

where $h_t \in \mathbb{R}^d$ is the hidden state, $x_t$ is the input at time $t$ , $W_h \in \mathbb{R}^{d \times d}$ is the recurrent weight matrix, $W_x$ is the input weight matrix, and $b$ is the bias. The output at each step is typically $o_t = W_o h_t$ .

The same parameters $(W_h, W_x, b)$ are shared across all time steps. this is weight sharing over time, analogous to weight sharing over space in CNNs.

Backpropagation Through Time (BPTT)

Training an RNN requires computing gradients through the entire sequence. Unrolling the recurrence for $T$ steps produces a very deep feedforward network with shared weights. The gradient of the loss at time $T$ with respect to the hidden state at time $t$ involves a product of Jacobians:

$\frac{\partial h_T}{\partial h_t} = \prod_{k=t+1}^{T} \frac{\partial h_k}{\partial h_{k-1}} = \prod_{k=t+1}^{T} \text{diag}(\tanh'(z_k)) \cdot W_h$

where $z_k = W_h h_{k-1} + W_x x_k + b$ . This product of matrices is the source of both vanishing and exploding gradients.

The Vanishing/Exploding Gradient Problem

Proposition

Gradient Scaling via Eigenvalues of W_h

Statement

For the simple RNN, the norm of the gradient $\|\partial h_T / \partial h_t\|$ scales approximately as $\|W_h\|^{T-t}$ (ignoring the tanh derivative, which is bounded by 1). Specifically:

If the largest singular value $\sigma_{\max}(W_h) < 1$ , gradients vanish exponentially: $\|\partial h_T / \partial h_t\| \to 0$ as $T - t \to \infty$
If $\sigma_{\max}(W_h) > 1$ , gradients can explode exponentially

With the tanh derivative included (which is in $[0, 1]$ ), vanishing is the dominant failure mode.

Intuition

Multiplying a matrix by itself many times amplifies its dominant eigenvalue. If that eigenvalue is less than 1, repeated multiplication drives the product to zero. If greater than 1, it explodes. Since $\tanh'(x) \leq 1$ , the tanh activation only makes vanishing worse, never better.

Proof Sketch

Bound $\|\prod_{k=t+1}^T \text{diag}(\tanh'(z_k)) W_h\| \leq \prod_{k=t+1}^T \|\text{diag}(\tanh'(z_k))\| \cdot \|W_h\| \leq \|W_h\|^{T-t}$ since $|\tanh'(z)| \leq 1$ . If $\|W_h\| < 1$ (as an operator norm), this bound decays exponentially with $T - t$ .

Why It Matters

Vanishing gradients mean the RNN cannot learn long-range dependencies: the gradient signal from a target at time $T$ cannot reach the hidden state at time $t \ll T$ . This is the fundamental limitation that motivated LSTMs, GRUs, and eventually transformers.

Failure Mode

This analysis uses worst-case norm bounds. In practice, gradients can be well-behaved even when $\|W_h\| > 1$ if the network operates in a regime where the effective Jacobian spectral radius stays near 1 (the "edge of chaos"). But simple RNNs are fragile. slight changes push them into vanishing or exploding regimes.

report a correction →

Gradient clipping is a common remedy for exploding gradients: if $\|g\| > \theta$ , rescale $g \leftarrow \theta \cdot g / \|g\|$ . This prevents catastrophic parameter updates but does not solve vanishing gradients.

LSTM: Long Short-Term Memory

Definition

LSTM Cell

The LSTM introduces a separate cell state $c_t$ that acts as an information highway, controlled by three gates:

Forget gate (what to erase from cell state): $f_t = \sigma(W_f [h_{t-1}, x_t] + b_f)$

Input gate (what new information to write): $i_t = \sigma(W_i [h_{t-1}, x_t] + b_i)$ $\tilde{c}_t = \tanh(W_c [h_{t-1}, x_t] + b_c)$

Cell state update (highway with gated read/write): $c_t = f_t \odot c_{t-1} + i_t \odot \tilde{c}_t$

Output gate (what to expose as hidden state): $o_t = \sigma(W_o [h_{t-1}, x_t] + b_o)$ $h_t = o_t \odot \tanh(c_t)$

Here $\sigma$ is the sigmoid function and $\odot$ is element-wise multiplication. All gates output values in $[0, 1]$ .

The cell state $c_t$ is the critical innovation. When $f_t \approx 1$ and $i_t \approx 0$ , the cell state passes through unchanged. gradients flow backward without decay. This is the "highway" that solves vanishing gradients: the gradient through the cell state is $\partial c_t / \partial c_{t-1} = \text{diag}(f_t)$ , and when forget gates are near 1, this is approximately the identity.

GRU: Gated Recurrent Unit

Definition

GRU Cell

The GRU simplifies the LSTM by merging the cell state and hidden state, using two gates instead of three:

Reset gate: $r_t = \sigma(W_r [h_{t-1}, x_t] + b_r)$

Update gate: $z_t = \sigma(W_z [h_{t-1}, x_t] + b_z)$

Candidate state: $\tilde{h}_t = \tanh(W [r_t \odot h_{t-1}, x_t] + b)$

Hidden state update: $h_t = (1 - z_t) \odot h_{t-1} + z_t \odot \tilde{h}_t$

The update gate $z_t$ interpolates between keeping the old state and adopting the new candidate. It simultaneously plays the roles of the LSTM forget and input gates.

GRUs have fewer parameters than LSTMs ( $3d^2$ vs. $4d^2$ for recurrent weights) and often perform comparably. Neither consistently outperforms the other across all tasks.

Why Transformers Replaced RNNs

RNNs process sequences step by step. The computation at time $t$ depends on time $t-1$ . This creates two problems:

Sequential bottleneck: cannot parallelize across time steps during training. Transformers process all positions simultaneously.
Long-range dependencies: even LSTMs struggle with very long sequences (hundreds to thousands of steps). Transformers connect every position to every other via self-attention with $O(1)$ path length.

However, RNNs retain advantages for online/streaming settings and when memory is limited (constant state size vs. growing KV cache). State space models (Mamba, etc.) combine RNN-like recurrence with transformer-like performance.

Canonical Examples

Example

Simple RNN language model

Given a vocabulary of $V$ words, embed each word as $x_t \in \mathbb{R}^d$ , update $h_t = \tanh(W_h h_{t-1} + W_x x_t)$ , and predict the next word via $p(w_{t+1} | w_{\leq t}) = \text{softmax}(W_o h_t)$ . Parameters: $W_h \in \mathbb{R}^{d \times d}$ , $W_x \in \mathbb{R}^{d \times d}$ , $W_o \in \mathbb{R}^{V \times d}$ , plus embeddings in $\mathbb{R}^{V \times d}$ . Total $\approx 2d^2 + 2Vd$ parameters.

Example

LSTM forget gate initialization

A practical trick: initialize the forget gate bias $b_f$ to a positive value (e.g., 1 or 2). This makes $f_t \approx 1$ at initialization, allowing information to flow through the cell state from the start. Without this, the network may learn to forget everything before it learns what to remember.

Common Confusions

Watch Out

RNN hidden state is not memory in the human sense

The hidden state $h_t$ is a fixed-size vector that must compress the entire history into $d$ dimensions. It is not a content-addressable memory. it cannot selectively recall specific past inputs. This fundamental bottleneck is what attention mechanisms solve by allowing the model to look back at all previous hidden states.

Watch Out

Vanishing gradients do not mean zero gradients

Vanishing gradients mean that the gradient contribution from distant time steps becomes negligible compared to recent ones. The total gradient is not zero. It is dominated by short-range dependencies. The RNN can still learn, but it effectively ignores long-range patterns.

Summary

Simple RNN: $h_t = \tanh(W_h h_{t-1} + W_x x_t)$ . same weights at every time step
Gradient scales as $\|W_h\|^{T-t}$ : vanishing if $\sigma_{\max} < 1$ , exploding if $\sigma_{\max} > 1$
LSTM adds a cell state highway with forget/input/output gates to preserve long-range gradient flow
GRU simplifies LSTM to two gates (reset and update) with comparable performance
Transformers replaced RNNs due to parallelism and superior long-range modeling, but RNN-like models are resurging via state space models

Exercises

ExerciseCore

Problem

An LSTM has hidden dimension $d = 256$ . How many parameters are in the recurrent weight matrices (ignore biases and input weights)? Compare to a simple RNN with the same hidden dimension.

ExerciseAdvanced

Problem

Show that when all forget gates $f_t = 1$ and all input gates $i_t = 0$ , the LSTM cell state $c_t = c_0$ for all $t$ . What does this imply about the gradient $\partial c_T / \partial c_0$ ?

References

Canonical:

Hochreiter & Schmidhuber, "Long Short-Term Memory" (1997)
Cho et al., "Learning Phrase Representations using RNN Encoder-Decoder" (2014)
Goodfellow, Bengio, Courville, Deep Learning (2016), Chapter 10

Current:

Gu & Dao, "Mamba: Linear-Time Sequence Modeling with Selective State Spaces" (2023)

Next Topics

The natural next steps from RNNs:

Transformers: the attention-based architecture that replaced RNNs
Attention mechanisms: the key innovation enabling long-range dependencies

Last reviewed: April 13, 2026

Canonical graph

Required before and derived from this topic

These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.

Full prerequisite chain All derived topics

Required prerequisites

3

Feedforward Networks and Backpropagationlayer 2 · tier 1
Convolutional Neural Networkslayer 3 · tier 2
Macroeconomic Time-Series Forecastinglayer 4 · tier 3

Derived topics

9

Attention Mechanisms Historylayer 3 · tier 2
Deep Learning for Time Serieslayer 3 · tier 2
Speech and Audio MLlayer 3 · tier 2
Forgetting Transformer (FoX)layer 4 · tier 2
Mamba and State-Space Modelslayer 4 · tier 2

+4 more on the derived-topics page.

Graph-backed continuations

Transformer Architecture Attention Mechanisms History Deep Learning for Time Series Forgetting Transformer (FoX)Mamba and State-Space Models Reservoir Computing and Echo State Networks RNNs for Signal Sequences Speech and Audio ML Test-Time Training and Adaptive Inference