Beta. Content is under active construction and has not been peer-reviewed. Report errors on GitHub.Disclaimer

ML Methods

Recurrent Neural Networks

Sequential processing via hidden state recurrence: the simple RNN, vanishing and exploding gradients, LSTM gating mechanisms, and why transformers have largely replaced RNNs.

AdvancedTier 2Stable~55 min

Why This Matters

RNNs are the conceptual foundation for sequence modeling in deep learning. Even though transformers have largely replaced them in practice, understanding RNNs is essential because: (1) the vanishing gradient problem they suffer from motivates the design of LSTMs, GRUs, and ultimately attention mechanisms; (2) gating mechanisms invented for LSTMs appear throughout modern architectures; (3) RNN-like state machines are resurgent in efficient inference (state space models, linear attention).

h0xyh1x1y1h2x2y2h3x3y3h4x4y4hTime (same weights W shared across all steps)

Mental Model

A feedforward network processes each input independently. An RNN maintains a hidden state hth_t that summarizes everything seen so far. At each time step, the new hidden state is a function of the previous state and the current input. This is a learned dynamical system. The hidden state is the systems memory.

The Simple RNN

Definition

Simple (Elman) RNN

The simple RNN update rule is:

ht=tanh(Whht1+Wxxt+b)h_t = \tanh(W_h h_{t-1} + W_x x_t + b)

where htRdh_t \in \mathbb{R}^d is the hidden state, xtx_t is the input at time tt, WhRd×dW_h \in \mathbb{R}^{d \times d} is the recurrent weight matrix, WxW_x is the input weight matrix, and bb is the bias. The output at each step is typically ot=Wohto_t = W_o h_t.

The same parameters (Wh,Wx,b)(W_h, W_x, b) are shared across all time steps. this is weight sharing over time, analogous to weight sharing over space in CNNs.

Backpropagation Through Time (BPTT)

Training an RNN requires computing gradients through the entire sequence. Unrolling the recurrence for TT steps produces a very deep feedforward network with shared weights. The gradient of the loss at time TT with respect to the hidden state at time tt involves a product of Jacobians:

hTht=k=t+1Thkhk1=k=t+1Tdiag(tanh(zk))Wh\frac{\partial h_T}{\partial h_t} = \prod_{k=t+1}^{T} \frac{\partial h_k}{\partial h_{k-1}} = \prod_{k=t+1}^{T} \text{diag}(\tanh'(z_k)) \cdot W_h

where zk=Whhk1+Wxxk+bz_k = W_h h_{k-1} + W_x x_k + b. This product of matrices is the source of both vanishing and exploding gradients.

The Vanishing/Exploding Gradient Problem

Proposition

Gradient Scaling via Eigenvalues of W_h

Statement

For the simple RNN, the norm of the gradient hT/ht\|\partial h_T / \partial h_t\| scales approximately as WhTt\|W_h\|^{T-t} (ignoring the tanh derivative, which is bounded by 1). Specifically:

  • If the largest singular value σmax(Wh)<1\sigma_{\max}(W_h) < 1, gradients vanish exponentially: hT/ht0\|\partial h_T / \partial h_t\| \to 0 as TtT - t \to \infty
  • If σmax(Wh)>1\sigma_{\max}(W_h) > 1, gradients can explode exponentially

With the tanh derivative included (which is in [0,1][0, 1]), vanishing is the dominant failure mode.

Intuition

Multiplying a matrix by itself many times amplifies its dominant eigenvalue. If that eigenvalue is less than 1, repeated multiplication drives the product to zero. If greater than 1, it explodes. Since tanh(x)1\tanh'(x) \leq 1, the tanh activation only makes vanishing worse, never better.

Proof Sketch

Bound k=t+1Tdiag(tanh(zk))Whk=t+1Tdiag(tanh(zk))WhWhTt\|\prod_{k=t+1}^T \text{diag}(\tanh'(z_k)) W_h\| \leq \prod_{k=t+1}^T \|\text{diag}(\tanh'(z_k))\| \cdot \|W_h\| \leq \|W_h\|^{T-t} since tanh(z)1|\tanh'(z)| \leq 1. If Wh<1\|W_h\| < 1 (as an operator norm), this bound decays exponentially with TtT - t.

Why It Matters

Vanishing gradients mean the RNN cannot learn long-range dependencies: the gradient signal from a target at time TT cannot reach the hidden state at time tTt \ll T. This is the fundamental limitation that motivated LSTMs, GRUs, and eventually transformers.

Failure Mode

This analysis uses worst-case norm bounds. In practice, gradients can be well-behaved even when Wh>1\|W_h\| > 1 if the network operates in a regime where the effective Jacobian spectral radius stays near 1 (the "edge of chaos"). But simple RNNs are fragile. slight changes push them into vanishing or exploding regimes.

Gradient clipping is a common remedy for exploding gradients: if g>θ\|g\| > \theta, rescale gθg/gg \leftarrow \theta \cdot g / \|g\|. This prevents catastrophic parameter updates but does not solve vanishing gradients.

LSTM: Long Short-Term Memory

Definition

LSTM Cell

The LSTM introduces a separate cell state ctc_t that acts as an information highway, controlled by three gates:

Forget gate (what to erase from cell state): ft=σ(Wf[ht1,xt]+bf)f_t = \sigma(W_f [h_{t-1}, x_t] + b_f)

Input gate (what new information to write): it=σ(Wi[ht1,xt]+bi)i_t = \sigma(W_i [h_{t-1}, x_t] + b_i) c~t=tanh(Wc[ht1,xt]+bc)\tilde{c}_t = \tanh(W_c [h_{t-1}, x_t] + b_c)

Cell state update (highway with gated read/write): ct=ftct1+itc~tc_t = f_t \odot c_{t-1} + i_t \odot \tilde{c}_t

Output gate (what to expose as hidden state): ot=σ(Wo[ht1,xt]+bo)o_t = \sigma(W_o [h_{t-1}, x_t] + b_o) ht=ottanh(ct)h_t = o_t \odot \tanh(c_t)

Here σ\sigma is the sigmoid function and \odot is element-wise multiplication. All gates output values in [0,1][0, 1].

The cell state ctc_t is the critical innovation. When ft1f_t \approx 1 and it0i_t \approx 0, the cell state passes through unchanged. gradients flow backward without decay. This is the "highway" that solves vanishing gradients: the gradient through the cell state is ct/ct1=diag(ft)\partial c_t / \partial c_{t-1} = \text{diag}(f_t), and when forget gates are near 1, this is approximately the identity.

GRU: Gated Recurrent Unit

Definition

GRU Cell

The GRU simplifies the LSTM by merging the cell state and hidden state, using two gates instead of three:

Reset gate: rt=σ(Wr[ht1,xt]+br)r_t = \sigma(W_r [h_{t-1}, x_t] + b_r)

Update gate: zt=σ(Wz[ht1,xt]+bz)z_t = \sigma(W_z [h_{t-1}, x_t] + b_z)

Candidate state: h~t=tanh(W[rtht1,xt]+b)\tilde{h}_t = \tanh(W [r_t \odot h_{t-1}, x_t] + b)

Hidden state update: ht=(1zt)ht1+zth~th_t = (1 - z_t) \odot h_{t-1} + z_t \odot \tilde{h}_t

The update gate ztz_t interpolates between keeping the old state and adopting the new candidate. It simultaneously plays the roles of the LSTM forget and input gates.

GRUs have fewer parameters than LSTMs (3d23d^2 vs. 4d24d^2 for recurrent weights) and often perform comparably. Neither consistently outperforms the other across all tasks.

Why Transformers Replaced RNNs

RNNs process sequences step by step. The computation at time tt depends on time t1t-1. This creates two problems:

  1. Sequential bottleneck: cannot parallelize across time steps during training. Transformers process all positions simultaneously.
  2. Long-range dependencies: even LSTMs struggle with very long sequences (hundreds to thousands of steps). Transformers connect every position to every other via self-attention with O(1)O(1) path length.

However, RNNs retain advantages for online/streaming settings and when memory is limited (constant state size vs. growing KV cache). State space models (Mamba, etc.) combine RNN-like recurrence with transformer-like performance.

Canonical Examples

Example

Simple RNN language model

Given a vocabulary of VV words, embed each word as xtRdx_t \in \mathbb{R}^d, update ht=tanh(Whht1+Wxxt)h_t = \tanh(W_h h_{t-1} + W_x x_t), and predict the next word via p(wt+1wt)=softmax(Woht)p(w_{t+1} | w_{\leq t}) = \text{softmax}(W_o h_t). Parameters: WhRd×dW_h \in \mathbb{R}^{d \times d}, WxRd×dW_x \in \mathbb{R}^{d \times d}, WoRV×dW_o \in \mathbb{R}^{V \times d}, plus embeddings in RV×d\mathbb{R}^{V \times d}. Total 2d2+2Vd\approx 2d^2 + 2Vd parameters.

Example

LSTM forget gate initialization

A practical trick: initialize the forget gate bias bfb_f to a positive value (e.g., 1 or 2). This makes ft1f_t \approx 1 at initialization, allowing information to flow through the cell state from the start. Without this, the network may learn to forget everything before it learns what to remember.

Common Confusions

Watch Out

RNN hidden state is not memory in the human sense

The hidden state hth_t is a fixed-size vector that must compress the entire history into dd dimensions. It is not a content-addressable memory. it cannot selectively recall specific past inputs. This fundamental bottleneck is what attention mechanisms solve by allowing the model to look back at all previous hidden states.

Watch Out

Vanishing gradients do not mean zero gradients

Vanishing gradients mean that the gradient contribution from distant time steps becomes negligible compared to recent ones. The total gradient is not zero. It is dominated by short-range dependencies. The RNN can still learn, but it effectively ignores long-range patterns.

Summary

  • Simple RNN: ht=tanh(Whht1+Wxxt)h_t = \tanh(W_h h_{t-1} + W_x x_t). same weights at every time step
  • Gradient scales as WhTt\|W_h\|^{T-t}: vanishing if σmax<1\sigma_{\max} < 1, exploding if σmax>1\sigma_{\max} > 1
  • LSTM adds a cell state highway with forget/input/output gates to preserve long-range gradient flow
  • GRU simplifies LSTM to two gates (reset and update) with comparable performance
  • Transformers replaced RNNs due to parallelism and superior long-range modeling, but RNN-like models are resurging via state space models

Exercises

ExerciseCore

Problem

An LSTM has hidden dimension d=256d = 256. How many parameters are in the recurrent weight matrices (ignore biases and input weights)? Compare to a simple RNN with the same hidden dimension.

ExerciseAdvanced

Problem

Show that when all forget gates ft=1f_t = 1 and all input gates it=0i_t = 0, the LSTM cell state ct=c0c_t = c_0 for all tt. What does this imply about the gradient cT/c0\partial c_T / \partial c_0?

References

Canonical:

  • Hochreiter & Schmidhuber, "Long Short-Term Memory" (1997)
  • Cho et al., "Learning Phrase Representations using RNN Encoder-Decoder" (2014)
  • Goodfellow, Bengio, Courville, Deep Learning (2016), Chapter 10

Current:

  • Gu & Dao, "Mamba: Linear-Time Sequence Modeling with Selective State Spaces" (2023)

  • Bishop, Pattern Recognition and Machine Learning (2006), Chapters 1-14

Next Topics

The natural next steps from RNNs:

  • Transformers: the attention-based architecture that replaced RNNs
  • Attention mechanisms: the key innovation enabling long-range dependencies

Last reviewed: April 2026

Prerequisites

Foundations this topic depends on.

Builds on This

Next Topics