Temporal Difference Learning

Sneiderman, Robby

RL Theory

Temporal Difference Learning

Temporal difference methods bootstrap value estimates from other value estimates, enabling online, incremental learning without waiting for episode termination. TD(0), SARSA, and TD(lambda) with eligibility traces.

CoreTier 2StableSupporting~50 min

Prerequisites

Markov Decision Processes Value Iteration and Policy Iteration Bellman Equations Stochastic Approximation Theory

Start 8-question practice · 1 available Prereq Map

Learning position

Read this page in the graph.

rl-theory | layer 2 | tier 2. This page has 4 direct prerequisites and 4 published dependents.

Open Atlas Prerequisites Leads to

What next

Q-Learning

This is the first curated or graph-derived continuation from the current page.

Evidence badge

Claim status

This page has no public Lean mapping yet. Use the evidence page to inspect how claim status labels work.

Show the backing system

AtlasOpen the full prerequisite graph and run grounding traces.EvidenceInspect source support, claim labels, and public trust status.LeanReview the checked declaration list, scopes, and axiom profile.

Why This Matters

Temporal difference learning is the central idea in reinforcement learning. It combines two insights:

From Monte Carlo: learn from experience, without a model of the environment.
From dynamic programming: bootstrap, using current value estimates to update themselves, without waiting for the final outcome.

TD methods can learn online (update after each step), which Monte Carlo cannot do mid-episode. TD methods do not require a model of the environment, which dynamic programming does. This combination makes TD practical for large, complex environments.

Q-learning is a TD method. SARSA is a TD method. Every deep RL algorithm that learns a value function uses some form of TD update.

The TD Error

The core quantity in TD learning is the TD error (or temporal difference error):

$\delta_t = R_{t+1} + \gamma V(S_{t+1}) - V(S_t)$

where $R_{t+1}$ is the reward received, $\gamma$ is the discount factor, $V(S_{t+1})$ is the current estimate of the next state's value, and $V(S_t)$ is the current estimate of the current state's value.

If $V = V^\pi$ (the true value function), then $\mathbb{E}[\delta_t | S_t = s] = 0$ . The TD error has zero mean under the true value function. This is a direct consequence of the Bellman equation:

$V^\pi(s) = \mathbb{E}[R_{t+1} + \gamma V^\pi(S_{t+1}) | S_t = s]$

Core Algorithms

Definition

TD(0) Update

The TD(0) update rule for state-value estimation under policy $\pi$ is:

$V(S_t) \leftarrow V(S_t) + \alpha \delta_t = V(S_t) + \alpha [R_{t+1} + \gamma V(S_{t+1}) - V(S_t)]$

where $\alpha > 0$ is the step size (learning rate). After each transition $(S_t, A_t, R_{t+1}, S_{t+1})$ , update $V(S_t)$ using the observed reward and the bootstrapped estimate $V(S_{t+1})$ .

Definition

SARSA

SARSA is TD(0) for action-value functions. After observing $(S_t, A_t, R_{t+1}, S_{t+1}, A_{t+1})$ , update:

$Q(S_t, A_t) \leftarrow Q(S_t, A_t) + \alpha [R_{t+1} + \gamma Q(S_{t+1}, A_{t+1}) - Q(S_t, A_t)]$

The name comes from the quintuple $(S, A, R, S, A)$ used in each update. SARSA is on-policy: it evaluates and improves the policy it is currently following.

Comparison: TD vs Monte Carlo vs DP

Monte Carlo waits until the end of an episode to compute $G_t = R_{t+1} + \gamma R_{t+2} + \gamma^2 R_{t+3} + \cdots$ and updates $V(S_t) \leftarrow V(S_t) + \alpha(G_t - V(S_t))$ . No bootstrapping. High variance (depends on the entire trajectory), zero bias.

TD(0) updates immediately using $R_{t+1} + \gamma V(S_{t+1})$ as a target. Bootstrapping introduces bias (the target uses the current, possibly wrong, estimate $V(S_{t+1})$ ) but reduces variance (depends on a single transition, not the whole trajectory).

Dynamic programming computes $V(s) = \sum_a \pi(a|s) \sum_{s'} p(s'|s,a)[r + \gamma V(s')]$ exactly. Requires a model $p(s'|s,a)$ . No sampling error, but needs the full transition dynamics.

TD sits between MC and DP. It samples like MC but bootstraps like DP.

Main Theorems

Theorem

TD(0) Convergence

Statement

Under a fixed policy $\pi$ , if the step sizes satisfy $\sum_{t=0}^\infty \alpha_t = \infty$ and $\sum_{t=0}^\infty \alpha_t^2 < \infty$ , and every state is visited infinitely often, then the TD(0) iterates $V_t(s)$ converge to $V^\pi(s)$ with probability 1 for all $s$ .

Intuition

TD(0) is a stochastic approximation algorithm for solving the Bellman equation $V = T^\pi V$ , where $T^\pi$ is the Bellman operator. The TD error is a noisy sample of $(T^\pi V - V)(s)$ . Standard stochastic approximation theory guarantees convergence when the operator is a contraction (which $T^\pi$ is under $\gamma < 1$ ) and the noise conditions are met.

Proof Sketch

Reformulate TD(0) as a stochastic approximation: $V_{t+1} = V_t + \alpha_t(T^\pi V_t - V_t + w_t)$ where $w_t$ is zero-mean noise. The Bellman operator $T^\pi$ is a $\gamma$ -contraction in the weighted max-norm. By the ODE method of Borkar and Meyn, or the Robbins-Monro theorem applied to contractive operators, convergence follows.

Why It Matters

This guarantees that TD(0) finds the correct value function. The step size conditions are the standard ones from stochastic approximation: the sum diverges (ensuring the iterates can reach any target) while the sum of squares converges (ensuring the noise averages out). In practice, constant step sizes are used, which gives tracking ability rather than convergence.

Failure Mode

The convergence is for tabular TD(0) with a fixed policy. With function approximation (e.g., neural networks), TD can diverge. The "deadly triad" of function approximation + bootstrapping + off-policy training can cause instability. Baird's counterexample demonstrates this failure explicitly.

report a correction →

TD(lambda): Eligibility Traces

TD(0) uses a one-step target: $R_{t+1} + \gamma V(S_{t+1})$ . The $n$ -step target uses $n$ steps of actual rewards:

$G_t^{(n)} = R_{t+1} + \gamma R_{t+2} + \cdots + \gamma^{n-1} R_{t+n} + \gamma^n V(S_{t+n})$

TD( $\lambda$ ) takes an exponentially weighted average of all $n$ -step returns:

$G_t^\lambda = (1 - \lambda) \sum_{n=1}^{\infty} \lambda^{n-1} G_t^{(n)}$

When $\lambda = 0$ , this reduces to TD(0). When $\lambda = 1$ , this equals the Monte Carlo return $G_t$ (assuming the episode terminates). Values of $\lambda$ between 0 and 1 interpolate between bootstrapping and full returns.

Eligibility traces implement TD( $\lambda$ ) efficiently. Maintain a trace vector $e_t(s)$ for each state:

$e_t(s) = \gamma \lambda \, e_{t-1}(s) + \mathbf{1}(S_t = s)$

Then update all states simultaneously: $V(s) \leftarrow V(s) + \alpha \delta_t e_t(s)$ . States visited recently get credit for the current TD error; the trace decays by $\gamma \lambda$ per step.

Common Confusions

Watch Out

TD converges to a different solution than Monte Carlo with function approximation

With linear function approximation, TD(0) converges to the value function that minimizes the mean-squared Bellman error projected onto the function approximation space. Monte Carlo converges to the function that minimizes mean-squared error against the true returns. These are different objectives and give different solutions. Neither is strictly better.

Watch Out

Bootstrapping introduces bias, not error

Bootstrapping means the update target $R + \gamma V(S')$ uses the current estimate $V(S')$ , which is wrong early in learning. This is a biased estimate of $V^\pi(S)$ . But the bias shrinks as $V$ improves. The benefit is variance reduction: a one-step target has much lower variance than a full Monte Carlo return in long episodes.

Watch Out

SARSA is on-policy, Q-learning is off-policy

SARSA updates $Q(S_t, A_t)$ toward $R + \gamma Q(S_{t+1}, A_{t+1})$ where $A_{t+1}$ is the action actually taken. Q-learning updates toward $R + \gamma \max_a Q(S_{t+1}, a)$ , which is the value under the greedy policy, regardless of which action was taken. This distinction matters for convergence guarantees under function approximation.

Canonical Examples

Example

TD(0) in a random walk

Consider a 5-state random walk: states A, B, C, D, E with terminal states at each end. The agent moves left or right with equal probability. True values under no discounting are $V(A) = 1/6, V(B) = 2/6, V(C) = 3/6, V(D) = 4/6, V(E) = 5/6$ . Initialize $V(s) = 0.5$ for all $s$ . After the first transition from C to D with reward 0: $\delta = 0 + 1 \cdot V(D) - V(C) = 0.5 - 0.5 = 0$ . No update. After a transition from D to terminal-right with reward 1: $\delta = 1 + 0 - 0.5 = 0.5$ . With $\alpha = 0.1$ : $V(D) \leftarrow 0.5 + 0.05 = 0.55$ .

Exercises

ExerciseCore

Problem

In TD(0), why does the TD error $\delta_t = R_{t+1} + \gamma V(S_{t+1}) - V(S_t)$ have zero mean under the true value function $V^\pi$ ? State the precise condition.

ExerciseAdvanced

Problem

Show that the sum of TD errors along a trajectory equals the Monte Carlo error. That is, for an episode ending at time $T$ , prove:

$\sum_{t=0}^{T-1} \gamma^t \delta_t = G_0 - V(S_0)$

where $G_0 = \sum_{t=0}^{T-1} \gamma^t R_{t+1}$ is the (undiscounted in $\gamma$ ) return and $V(S_T) = 0$ .

References

Canonical:

Sutton & Barto, Reinforcement Learning: An Introduction (2018), Chapters 6-7, 12
Bertsekas & Tsitsiklis, Neuro-Dynamic Programming (1996), Chapter 5

Current:

Szepesvari, Algorithms for Reinforcement Learning (2010), Chapter 3
Dann, Lattimore, Brunskill, "Unifying PAC and Regret" (2017) for finite-time TD bounds

Last reviewed: April 14, 2026

Canonical graph

Required before and derived from this topic

These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.

Full prerequisite chain All derived topics

Required prerequisites

4

Bellman Equationslayer 2 · tier 1
Markov Decision Processeslayer 2 · tier 1
Value Iteration and Policy Iterationlayer 2 · tier 1
Stochastic Approximation Theorylayer 2 · tier 2

Derived topics

4

Q-Learninglayer 2 · tier 1
Policy Gradient Theoremlayer 3 · tier 1
Actor-Critic Methodslayer 3 · tier 2
Reward Systems and Reinforcement Learning Neurosciencelayer 4 · tier 3

Graph-backed continuations

Q-Learning Actor-Critic Methods Policy Gradient Theorem Reward Systems and Reinforcement Learning Neuroscience