RL Theory
Temporal Difference Learning
Temporal difference methods bootstrap value estimates from other value estimates, enabling online, incremental learning without waiting for episode termination. TD(0), SARSA, and TD(lambda) with eligibility traces.
Why This Matters
Temporal difference learning is the central idea in reinforcement learning. It combines two insights:
- From Monte Carlo: learn from experience, without a model of the environment.
- From dynamic programming: bootstrap, using current value estimates to update themselves, without waiting for the final outcome.
TD methods can learn online (update after each step), which Monte Carlo cannot do mid-episode. TD methods do not require a model of the environment, which dynamic programming does. This combination makes TD practical for large, complex environments.
Q-learning is a TD method. SARSA is a TD method. Every deep RL algorithm that learns a value function uses some form of TD update.
The TD Error
The core quantity in TD learning is the TD error (or temporal difference error):
where is the reward received, is the discount factor, is the current estimate of the next state's value, and is the current estimate of the current state's value.
If (the true value function), then . The TD error has zero mean under the true value function. This is a direct consequence of the Bellman equation:
Core Algorithms
TD(0) Update
The TD(0) update rule for state-value estimation under policy is:
where is the step size (learning rate). After each transition , update using the observed reward and the bootstrapped estimate .
SARSA
SARSA is TD(0) for action-value functions. After observing , update:
The name comes from the quintuple used in each update. SARSA is on-policy: it evaluates and improves the policy it is currently following.
Comparison: TD vs Monte Carlo vs DP
Monte Carlo waits until the end of an episode to compute and updates . No bootstrapping. High variance (depends on the entire trajectory), zero bias.
TD(0) updates immediately using as a target. Bootstrapping introduces bias (the target uses the current, possibly wrong, estimate ) but reduces variance (depends on a single transition, not the whole trajectory).
Dynamic programming computes exactly. Requires a model . No sampling error, but needs the full transition dynamics.
TD sits between MC and DP. It samples like MC but bootstraps like DP.
Main Theorems
TD(0) Convergence
Statement
Under a fixed policy , if the step sizes satisfy and , and every state is visited infinitely often, then the TD(0) iterates converge to with probability 1 for all .
Intuition
TD(0) is a stochastic approximation algorithm for solving the Bellman equation , where is the Bellman operator. The TD error is a noisy sample of . Standard stochastic approximation theory guarantees convergence when the operator is a contraction (which is under ) and the noise conditions are met.
Proof Sketch
Reformulate TD(0) as a stochastic approximation: where is zero-mean noise. The Bellman operator is a -contraction in the weighted max-norm. By the ODE method of Borkar and Meyn, or the Robbins-Monro theorem applied to contractive operators, convergence follows.
Why It Matters
This guarantees that TD(0) finds the correct value function. The step size conditions are the standard ones from stochastic approximation: the sum diverges (ensuring the iterates can reach any target) while the sum of squares converges (ensuring the noise averages out). In practice, constant step sizes are used, which gives tracking ability rather than convergence.
Failure Mode
The convergence is for tabular TD(0) with a fixed policy. With function approximation (e.g., neural networks), TD can diverge. The "deadly triad" of function approximation + bootstrapping + off-policy training can cause instability. Baird's counterexample demonstrates this failure explicitly.
TD(lambda): Eligibility Traces
TD(0) uses a one-step target: . The -step target uses steps of actual rewards:
TD() takes an exponentially weighted average of all -step returns:
When , this reduces to TD(0). When , this equals the Monte Carlo return (assuming the episode terminates). Values of between 0 and 1 interpolate between bootstrapping and full returns.
Eligibility traces implement TD() efficiently. Maintain a trace vector for each state:
Then update all states simultaneously: . States visited recently get credit for the current TD error; the trace decays by per step.
Common Confusions
TD converges to a different solution than Monte Carlo with function approximation
With linear function approximation, TD(0) converges to the value function that minimizes the mean-squared Bellman error projected onto the function approximation space. Monte Carlo converges to the function that minimizes mean-squared error against the true returns. These are different objectives and give different solutions. Neither is strictly better.
Bootstrapping introduces bias, not error
Bootstrapping means the update target uses the current estimate , which is wrong early in learning. This is a biased estimate of . But the bias shrinks as improves. The benefit is variance reduction: a one-step target has much lower variance than a full Monte Carlo return in long episodes.
SARSA is on-policy, Q-learning is off-policy
SARSA updates toward where is the action actually taken. Q-learning updates toward , which is the value under the greedy policy, regardless of which action was taken. This distinction matters for convergence guarantees under function approximation.
Canonical Examples
TD(0) in a random walk
Consider a 5-state random walk: states A, B, C, D, E with terminal states at each end. The agent moves left or right with equal probability. True values under no discounting are . Initialize for all . After the first transition from C to D with reward 0: . No update. After a transition from D to terminal-right with reward 1: . With : .
Exercises
Problem
In TD(0), why does the TD error have zero mean under the true value function ? State the precise condition.
Problem
Show that the sum of TD errors along a trajectory equals the Monte Carlo error. That is, for an episode ending at time , prove:
where is the (undiscounted in ) return and .
References
Canonical:
- Sutton & Barto, Reinforcement Learning: An Introduction (2018), Chapters 6-7, 12
- Bertsekas & Tsitsiklis, Neuro-Dynamic Programming (1996), Chapter 5
Current:
- Szepesvari, Algorithms for Reinforcement Learning (2010), Chapter 3
- Dann, Lattimore, Brunskill, "Unifying PAC and Regret" (2017) for finite-time TD bounds
Last reviewed: April 2026
Prerequisites
Foundations this topic depends on.
- Markov Decision ProcessesLayer 2
- Convex Optimization BasicsLayer 1
- Differentiation in RnLayer 0A
- Sets, Functions, and RelationsLayer 0A
- Basic Logic and Proof TechniquesLayer 0A
- Matrix Operations and PropertiesLayer 0A
- Concentration InequalitiesLayer 1
- Common Probability DistributionsLayer 0A
- Expectation, Variance, Covariance, and MomentsLayer 0A
- Value Iteration and Policy IterationLayer 2