Beta. Content is under active construction and has not been peer-reviewed. Report errors on GitHub.Disclaimer

RL Theory

Temporal Difference Learning

Temporal difference methods bootstrap value estimates from other value estimates, enabling online, incremental learning without waiting for episode termination. TD(0), SARSA, and TD(lambda) with eligibility traces.

CoreTier 2Stable~50 min

Why This Matters

Temporal difference learning is the central idea in reinforcement learning. It combines two insights:

  1. From Monte Carlo: learn from experience, without a model of the environment.
  2. From dynamic programming: bootstrap, using current value estimates to update themselves, without waiting for the final outcome.

TD methods can learn online (update after each step), which Monte Carlo cannot do mid-episode. TD methods do not require a model of the environment, which dynamic programming does. This combination makes TD practical for large, complex environments.

Q-learning is a TD method. SARSA is a TD method. Every deep RL algorithm that learns a value function uses some form of TD update.

The TD Error

The core quantity in TD learning is the TD error (or temporal difference error):

δt=Rt+1+γV(St+1)V(St)\delta_t = R_{t+1} + \gamma V(S_{t+1}) - V(S_t)

where Rt+1R_{t+1} is the reward received, γ\gamma is the discount factor, V(St+1)V(S_{t+1}) is the current estimate of the next state's value, and V(St)V(S_t) is the current estimate of the current state's value.

If V=VπV = V^\pi (the true value function), then E[δtSt=s]=0\mathbb{E}[\delta_t | S_t = s] = 0. The TD error has zero mean under the true value function. This is a direct consequence of the Bellman equation:

Vπ(s)=E[Rt+1+γVπ(St+1)St=s]V^\pi(s) = \mathbb{E}[R_{t+1} + \gamma V^\pi(S_{t+1}) | S_t = s]

Core Algorithms

Definition

TD(0) Update

The TD(0) update rule for state-value estimation under policy π\pi is:

V(St)V(St)+αδt=V(St)+α[Rt+1+γV(St+1)V(St)]V(S_t) \leftarrow V(S_t) + \alpha \delta_t = V(S_t) + \alpha [R_{t+1} + \gamma V(S_{t+1}) - V(S_t)]

where α>0\alpha > 0 is the step size (learning rate). After each transition (St,At,Rt+1,St+1)(S_t, A_t, R_{t+1}, S_{t+1}), update V(St)V(S_t) using the observed reward and the bootstrapped estimate V(St+1)V(S_{t+1}).

Definition

SARSA

SARSA is TD(0) for action-value functions. After observing (St,At,Rt+1,St+1,At+1)(S_t, A_t, R_{t+1}, S_{t+1}, A_{t+1}), update:

Q(St,At)Q(St,At)+α[Rt+1+γQ(St+1,At+1)Q(St,At)]Q(S_t, A_t) \leftarrow Q(S_t, A_t) + \alpha [R_{t+1} + \gamma Q(S_{t+1}, A_{t+1}) - Q(S_t, A_t)]

The name comes from the quintuple (S,A,R,S,A)(S, A, R, S, A) used in each update. SARSA is on-policy: it evaluates and improves the policy it is currently following.

Comparison: TD vs Monte Carlo vs DP

Monte Carlo waits until the end of an episode to compute Gt=Rt+1+γRt+2+γ2Rt+3+G_t = R_{t+1} + \gamma R_{t+2} + \gamma^2 R_{t+3} + \cdots and updates V(St)V(St)+α(GtV(St))V(S_t) \leftarrow V(S_t) + \alpha(G_t - V(S_t)). No bootstrapping. High variance (depends on the entire trajectory), zero bias.

TD(0) updates immediately using Rt+1+γV(St+1)R_{t+1} + \gamma V(S_{t+1}) as a target. Bootstrapping introduces bias (the target uses the current, possibly wrong, estimate V(St+1)V(S_{t+1})) but reduces variance (depends on a single transition, not the whole trajectory).

Dynamic programming computes V(s)=aπ(as)sp(ss,a)[r+γV(s)]V(s) = \sum_a \pi(a|s) \sum_{s'} p(s'|s,a)[r + \gamma V(s')] exactly. Requires a model p(ss,a)p(s'|s,a). No sampling error, but needs the full transition dynamics.

TD sits between MC and DP. It samples like MC but bootstraps like DP.

Main Theorems

Theorem

TD(0) Convergence

Statement

Under a fixed policy π\pi, if the step sizes satisfy t=0αt=\sum_{t=0}^\infty \alpha_t = \infty and t=0αt2<\sum_{t=0}^\infty \alpha_t^2 < \infty, and every state is visited infinitely often, then the TD(0) iterates Vt(s)V_t(s) converge to Vπ(s)V^\pi(s) with probability 1 for all ss.

Intuition

TD(0) is a stochastic approximation algorithm for solving the Bellman equation V=TπVV = T^\pi V, where TπT^\pi is the Bellman operator. The TD error is a noisy sample of (TπVV)(s)(T^\pi V - V)(s). Standard stochastic approximation theory guarantees convergence when the operator is a contraction (which TπT^\pi is under γ<1\gamma < 1) and the noise conditions are met.

Proof Sketch

Reformulate TD(0) as a stochastic approximation: Vt+1=Vt+αt(TπVtVt+wt)V_{t+1} = V_t + \alpha_t(T^\pi V_t - V_t + w_t) where wtw_t is zero-mean noise. The Bellman operator TπT^\pi is a γ\gamma-contraction in the weighted max-norm. By the ODE method of Borkar and Meyn, or the Robbins-Monro theorem applied to contractive operators, convergence follows.

Why It Matters

This guarantees that TD(0) finds the correct value function. The step size conditions are the standard ones from stochastic approximation: the sum diverges (ensuring the iterates can reach any target) while the sum of squares converges (ensuring the noise averages out). In practice, constant step sizes are used, which gives tracking ability rather than convergence.

Failure Mode

The convergence is for tabular TD(0) with a fixed policy. With function approximation (e.g., neural networks), TD can diverge. The "deadly triad" of function approximation + bootstrapping + off-policy training can cause instability. Baird's counterexample demonstrates this failure explicitly.

TD(lambda): Eligibility Traces

TD(0) uses a one-step target: Rt+1+γV(St+1)R_{t+1} + \gamma V(S_{t+1}). The nn-step target uses nn steps of actual rewards:

Gt(n)=Rt+1+γRt+2++γn1Rt+n+γnV(St+n)G_t^{(n)} = R_{t+1} + \gamma R_{t+2} + \cdots + \gamma^{n-1} R_{t+n} + \gamma^n V(S_{t+n})

TD(λ\lambda) takes an exponentially weighted average of all nn-step returns:

Gtλ=(1λ)n=1λn1Gt(n)G_t^\lambda = (1 - \lambda) \sum_{n=1}^{\infty} \lambda^{n-1} G_t^{(n)}

When λ=0\lambda = 0, this reduces to TD(0). When λ=1\lambda = 1, this equals the Monte Carlo return GtG_t (assuming the episode terminates). Values of λ\lambda between 0 and 1 interpolate between bootstrapping and full returns.

Eligibility traces implement TD(λ\lambda) efficiently. Maintain a trace vector et(s)e_t(s) for each state:

et(s)=γλet1(s)+1(St=s)e_t(s) = \gamma \lambda \, e_{t-1}(s) + \mathbf{1}(S_t = s)

Then update all states simultaneously: V(s)V(s)+αδtet(s)V(s) \leftarrow V(s) + \alpha \delta_t e_t(s). States visited recently get credit for the current TD error; the trace decays by γλ\gamma \lambda per step.

Common Confusions

Watch Out

TD converges to a different solution than Monte Carlo with function approximation

With linear function approximation, TD(0) converges to the value function that minimizes the mean-squared Bellman error projected onto the function approximation space. Monte Carlo converges to the function that minimizes mean-squared error against the true returns. These are different objectives and give different solutions. Neither is strictly better.

Watch Out

Bootstrapping introduces bias, not error

Bootstrapping means the update target R+γV(S)R + \gamma V(S') uses the current estimate V(S)V(S'), which is wrong early in learning. This is a biased estimate of Vπ(S)V^\pi(S). But the bias shrinks as VV improves. The benefit is variance reduction: a one-step target has much lower variance than a full Monte Carlo return in long episodes.

Watch Out

SARSA is on-policy, Q-learning is off-policy

SARSA updates Q(St,At)Q(S_t, A_t) toward R+γQ(St+1,At+1)R + \gamma Q(S_{t+1}, A_{t+1}) where At+1A_{t+1} is the action actually taken. Q-learning updates toward R+γmaxaQ(St+1,a)R + \gamma \max_a Q(S_{t+1}, a), which is the value under the greedy policy, regardless of which action was taken. This distinction matters for convergence guarantees under function approximation.

Canonical Examples

Example

TD(0) in a random walk

Consider a 5-state random walk: states A, B, C, D, E with terminal states at each end. The agent moves left or right with equal probability. True values under no discounting are V(A)=1/6,V(B)=2/6,V(C)=3/6,V(D)=4/6,V(E)=5/6V(A) = 1/6, V(B) = 2/6, V(C) = 3/6, V(D) = 4/6, V(E) = 5/6. Initialize V(s)=0.5V(s) = 0.5 for all ss. After the first transition from C to D with reward 0: δ=0+1V(D)V(C)=0.50.5=0\delta = 0 + 1 \cdot V(D) - V(C) = 0.5 - 0.5 = 0. No update. After a transition from D to terminal-right with reward 1: δ=1+00.5=0.5\delta = 1 + 0 - 0.5 = 0.5. With α=0.1\alpha = 0.1: V(D)0.5+0.05=0.55V(D) \leftarrow 0.5 + 0.05 = 0.55.

Exercises

ExerciseCore

Problem

In TD(0), why does the TD error δt=Rt+1+γV(St+1)V(St)\delta_t = R_{t+1} + \gamma V(S_{t+1}) - V(S_t) have zero mean under the true value function VπV^\pi? State the precise condition.

ExerciseAdvanced

Problem

Show that the sum of TD errors along a trajectory equals the Monte Carlo error. That is, for an episode ending at time TT, prove:

t=0T1γtδt=G0V(S0)\sum_{t=0}^{T-1} \gamma^t \delta_t = G_0 - V(S_0)

where G0=t=0T1γtRt+1G_0 = \sum_{t=0}^{T-1} \gamma^t R_{t+1} is the (undiscounted in γ\gamma) return and V(ST)=0V(S_T) = 0.

References

Canonical:

  • Sutton & Barto, Reinforcement Learning: An Introduction (2018), Chapters 6-7, 12
  • Bertsekas & Tsitsiklis, Neuro-Dynamic Programming (1996), Chapter 5

Current:

  • Szepesvari, Algorithms for Reinforcement Learning (2010), Chapter 3
  • Dann, Lattimore, Brunskill, "Unifying PAC and Regret" (2017) for finite-time TD bounds

Last reviewed: April 2026

Prerequisites

Foundations this topic depends on.

Next Topics