Beta. Content is under active construction and has not been peer-reviewed. Report errors on GitHub.Disclaimer

Reinforcement Learning

Bellman Equations

The recursive backbone of RL. State-value and action-value Bellman equations, the contraction mapping property, convergence of value iteration, and why recursive decomposition is the central idea in sequential decision-making.

CoreTier 1Stable~45 min

Why This Matters

Every reinforcement learning algorithm rests on the Bellman equations. Value iteration, policy iteration, Q-learning, SARSA, TD learning, actor-critic methods: all of them are either solving or approximating solutions to Bellman equations.

The key insight is recursive decomposition. The value of being in a state equals the immediate reward plus the discounted value of the next state. This single idea, expressed as a fixed-point equation, transforms an infinite-horizon optimization problem into a tractable recursion. Without it, you would need to enumerate all possible future trajectories, which is impossible in any nontrivial environment.

If you understand the Bellman equations and their contraction properties, you understand why RL algorithms converge and when they fail.

Prerequisites

This page assumes familiarity with Markov decision processes (states, actions, transitions, rewards, discount factor, policies) and basic expectation and variance.

P=0.7P=0.3P=0.4P=0.6sa₁a₂s₁'s₂'s₃'s₄'V=8V=3V=5V=10R=2R=1V(s) = max(2 + γ(0.7·8 + 0.3·3), 1 + γ(0.4·5 + 0.6·10))Choose the action that maximizes immediate reward plus discounted future valueStateActionTransition (P)

Core Definitions

Definition

State-Value Function

The state-value function for policy π\pi gives the expected discounted return starting from state ss and following π\pi thereafter:

Vπ(s)=Eπ[k=0γkrt+k  |  st=s]V^\pi(s) = \mathbb{E}_\pi\left[\sum_{k=0}^{\infty} \gamma^k r_{t+k} \;\middle|\; s_t = s\right]

This is well-defined when γ[0,1)\gamma \in [0, 1) and rewards are bounded.

Definition

Action-Value Function

The action-value function for policy π\pi gives the expected discounted return starting from state ss, taking action aa, and then following π\pi:

Qπ(s,a)=Eπ[k=0γkrt+k  |  st=s,at=a]Q^\pi(s,a) = \mathbb{E}_\pi\left[\sum_{k=0}^{\infty} \gamma^k r_{t+k} \;\middle|\; s_t = s, a_t = a\right]

The relationship between VπV^\pi and QπQ^\pi is:

Vπ(s)=aAπ(as)Qπ(s,a)V^\pi(s) = \sum_{a \in \mathcal{A}} \pi(a|s) \, Q^\pi(s, a)

Definition

Optimal Value Functions

The optimal state-value function is V(s)=maxπVπ(s)V^*(s) = \max_\pi V^\pi(s) for all ss. The optimal action-value function is Q(s,a)=maxπQπ(s,a)Q^*(s,a) = \max_\pi Q^\pi(s,a) for all s,as, a. An optimal policy π\pi^* achieves Vπ(s)=V(s)V^{\pi^*}(s) = V^*(s) for all states simultaneously.

The Four Bellman Equations

There are four Bellman equations: expectation and optimality versions for both VV and QQ. All four express the same recursive structure.

Bellman Expectation Equations

For a fixed policy π\pi:

Vπ(s)=aπ(as)[R(s,a)+γsP(ss,a)Vπ(s)]V^\pi(s) = \sum_{a} \pi(a|s) \left[ R(s,a) + \gamma \sum_{s'} P(s'|s,a) \, V^\pi(s') \right]

Qπ(s,a)=R(s,a)+γsP(ss,a)aπ(as)Qπ(s,a)Q^\pi(s,a) = R(s,a) + \gamma \sum_{s'} P(s'|s,a) \sum_{a'} \pi(a'|s') \, Q^\pi(s',a')

These are linear systems. For a fixed policy in a finite MDP, the Bellman expectation equation for VπV^\pi is a system of S|\mathcal{S}| linear equations in S|\mathcal{S}| unknowns, solvable by matrix inversion: Vπ=(IγPπ)1RπV^\pi = (I - \gamma P^\pi)^{-1} R^\pi.

Bellman Optimality Equations

V(s)=maxaA[R(s,a)+γsP(ss,a)V(s)]V^*(s) = \max_{a \in \mathcal{A}} \left[ R(s,a) + \gamma \sum_{s'} P(s'|s,a) \, V^*(s') \right]

Q(s,a)=R(s,a)+γsP(ss,a)maxaQ(s,a)Q^*(s,a) = R(s,a) + \gamma \sum_{s'} P(s'|s,a) \max_{a'} Q^*(s',a')

These are nonlinear because of the max operator. They cannot be solved by matrix inversion. Instead, we solve them iteratively via value iteration.

Main Theorems

Theorem

Bellman Optimality Operator is a Gamma-Contraction

Statement

Define the Bellman optimality operator T:RSRS\mathcal{T}: \mathbb{R}^{|\mathcal{S}|} \to \mathbb{R}^{|\mathcal{S}|} by:

(TV)(s)=maxaA[R(s,a)+γsP(ss,a)V(s)](\mathcal{T}V)(s) = \max_{a \in \mathcal{A}} \left[ R(s,a) + \gamma \sum_{s'} P(s'|s,a) \, V(s') \right]

Then T\mathcal{T} is a γ\gamma-contraction in the \ell^\infty (max-norm):

TVTUγVU\|\mathcal{T}V - \mathcal{T}U\|_\infty \leq \gamma \|V - U\|_\infty

By the Banach fixed-point theorem, T\mathcal{T} has a unique fixed point VV^*, and the sequence Vk+1=TVkV_{k+1} = \mathcal{T}V_k converges to VV^* geometrically:

VkVγkV0V\|V_k - V^*\|_\infty \leq \gamma^k \|V_0 - V^*\|_\infty

Intuition

Each application of the Bellman update shrinks the distance between any value estimate and the true optimal value by a factor of γ\gamma. The discount factor does the work: it ensures that errors in distant future values are damped. After kk iterations, the error is at most γk\gamma^k times the initial error.

Proof Sketch

For any V,URSV, U \in \mathbb{R}^{|\mathcal{S}|} and any state ss, let aa^* be the action achieving the max for TV\mathcal{T}V at state ss. Then:

(TV)(s)(TU)(s)R(s,a)+γsP(ss,a)V(s)R(s,a)γsP(ss,a)U(s)(\mathcal{T}V)(s) - (\mathcal{T}U)(s) \leq R(s,a^*) + \gamma \sum_{s'} P(s'|s,a^*) V(s') - R(s,a^*) - \gamma \sum_{s'} P(s'|s,a^*) U(s')

=γsP(ss,a)(V(s)U(s))γVU= \gamma \sum_{s'} P(s'|s,a^*)(V(s') - U(s')) \leq \gamma \|V - U\|_\infty

The inequality uses that P(s,a)P(\cdot|s,a^*) is a probability distribution, so the weighted average of V(s)U(s)V(s') - U(s') is at most the maximum. By symmetry (swapping VV and UU), we get (TV)(s)(TU)(s)γVU|(\mathcal{T}V)(s) - (\mathcal{T}U)(s)| \leq \gamma \|V - U\|_\infty for all ss.

Why It Matters

This theorem is the mathematical foundation of value iteration. It guarantees three things: (1) the optimal value function VV^* exists and is unique, (2) value iteration converges to it from any initialization, and (3) the convergence rate is geometric with ratio γ\gamma. Without this result, there would be no guarantee that iterating the Bellman update produces anything useful.

Failure Mode

When γ=1\gamma = 1 (no discounting), the Bellman operator is no longer a contraction. Value iteration can diverge or oscillate. The undiscounted case requires additional structure, such as guaranteeing that all policies eventually reach a terminal state (episodic setting). For average-reward MDPs with γ=1\gamma = 1, different fixed-point conditions and algorithms are needed.

With function approximation (neural networks instead of tables), the contraction property can be lost. The composition of a contraction (Bellman update) with a projection (function approximation) need not be a contraction. This is one leg of the "deadly triad" in RL.

Proposition

Bellman Expectation Operator Contraction

Statement

The Bellman expectation operator TπV=Rπ+γPπV\mathcal{T}^\pi V = R^\pi + \gamma P^\pi V is also a γ\gamma-contraction in the \ell^\infty norm:

TπVTπUγVU\|\mathcal{T}^\pi V - \mathcal{T}^\pi U\|_\infty \leq \gamma \|V - U\|_\infty

Therefore VπV^\pi is the unique fixed point of Tπ\mathcal{T}^\pi, and iterative policy evaluation converges to VπV^\pi.

Intuition

The proof is simpler than for the optimality operator because there is no max. The operator Tπ\mathcal{T}^\pi is linear: TπVTπU=γPπ(VU)\mathcal{T}^\pi V - \mathcal{T}^\pi U = \gamma P^\pi (V - U). Since PπP^\pi is a stochastic matrix, it cannot increase the max-norm.

Proof Sketch

TπVTπU=γPπ(VU)γPπ(VU)γVU\|\mathcal{T}^\pi V - \mathcal{T}^\pi U\|_\infty = \|\gamma P^\pi(V - U)\|_\infty \leq \gamma \|P^\pi(V-U)\|_\infty \leq \gamma \|V - U\|_\infty. The last step uses that each row of PπP^\pi sums to 1, so the weighted average of entries of VUV - U cannot exceed VU\|V - U\|_\infty.

Why It Matters

This guarantees that policy evaluation (computing VπV^\pi for a given policy) converges. Policy evaluation is a subroutine of policy iteration, and approximate versions of it appear in TD learning and actor-critic methods.

Value Iteration as Repeated Operator Application

Value iteration is the algorithm that repeatedly applies T\mathcal{T}:

  1. Initialize V0V_0 arbitrarily (e.g., V0=0V_0 = 0)
  2. For k=0,1,2,k = 0, 1, 2, \ldots: compute Vk+1(s)=maxa[R(s,a)+γsP(ss,a)Vk(s)]V_{k+1}(s) = \max_a [R(s,a) + \gamma \sum_{s'} P(s'|s,a) V_k(s')] for all ss
  3. Stop when Vk+1Vk<ϵ(1γ)/(2γ)\|V_{k+1} - V_k\|_\infty < \epsilon(1 - \gamma) / (2\gamma)
  4. Extract policy: π(s)=argmaxa[R(s,a)+γsP(ss,a)Vk(s)]\pi^*(s) = \arg\max_a [R(s,a) + \gamma \sum_{s'} P(s'|s,a) V_k(s')]

The stopping criterion in step 3 guarantees the extracted policy is ϵ\epsilon-optimal.

Complexity per iteration: O(S2A)O(|\mathcal{S}|^2 |\mathcal{A}|), since for each state we must evaluate each action and sum over successor states. Iterations to convergence: O(11γlog1ϵ(1γ))O(\frac{1}{1 - \gamma} \log \frac{1}{\epsilon(1 - \gamma)}).

Connection to Dynamic Programming

The Bellman equations are the stochastic generalization of the DP recurrence. In deterministic DP, the transition is certain and the Bellman equation becomes:

V(s)=maxa[R(s,a)+γV(f(s,a))]V^*(s) = \max_a [R(s,a) + \gamma V^*(f(s,a))]

where f(s,a)f(s,a) is the deterministic successor. Shortest-path algorithms (Dijkstra, Bellman-Ford), sequence alignment, and optimal control all solve specific instances of this equation.

The Bellman-Ford algorithm for shortest paths is literally value iteration on a deterministic MDP with γ=1\gamma = 1 and a guarantee of termination (finite horizon or no negative cycles).

Connection to TD Learning

TD(0) updates approximate the Bellman expectation equation using a single sample:

V(st)V(st)+α[rt+γV(st+1)V(st)]V(s_t) \leftarrow V(s_t) + \alpha \left[ r_t + \gamma V(s_{t+1}) - V(s_t) \right]

The quantity δt=rt+γV(st+1)V(st)\delta_t = r_t + \gamma V(s_{t+1}) - V(s_t) is the TD error. It is a noisy, sampled version of the Bellman residual TπV(st)V(st)\mathcal{T}^\pi V(s_t) - V(s_t). TD learning converges to the fixed point of Tπ\mathcal{T}^\pi under appropriate step-size conditions (Robbins-Monro conditions: αt=\sum \alpha_t = \infty, αt2<\sum \alpha_t^2 < \infty).

Q-learning similarly approximates the Bellman optimality equation:

Q(st,at)Q(st,at)+α[rt+γmaxaQ(st+1,a)Q(st,at)]Q(s_t, a_t) \leftarrow Q(s_t, a_t) + \alpha \left[ r_t + \gamma \max_{a'} Q(s_{t+1}, a') - Q(s_t, a_t) \right]

This converges to QQ^* under the same step-size conditions plus a requirement that all state-action pairs are visited infinitely often.

The Curse of Dimensionality

The Bellman equations are exact but require representing a value for every state (or state-action pair). For a continuous state space or a state space that grows exponentially with the number of state variables, exact solution is impossible. This is Bellman's own "curse of dimensionality."

Function approximation addresses this by parameterizing V(s;θ)V(s; \theta) or Q(s,a;θ)Q(s, a; \theta) with a neural network or linear model. The Bellman update becomes a regression target:

θk+1=argminθs(V(s;θ)TVk(s))2\theta_{k+1} = \arg\min_\theta \sum_s \left( V(s; \theta) - \mathcal{T}V_k(s) \right)^2

This introduces the deadly triad: the combination of (1) function approximation, (2) bootstrapping (using VkV_k in the target), and (3) off-policy learning can cause divergence. The Bellman operator is still a contraction, but projecting onto the function approximation class can undo the contraction. This remains one of the central open problems in RL theory.

Common Confusions

Watch Out

Bellman expectation vs Bellman optimality equation

The expectation equation holds for a fixed policy π\pi and is linear in VV. The optimality equation involves a max over actions and is nonlinear. The expectation equation can be solved by matrix inversion; the optimality equation requires iterative methods. Students often conflate the two, but they serve different purposes: the expectation equation evaluates a given policy, while the optimality equation finds the best one.

Watch Out

The Bellman operator contracts in max-norm, not L2-norm

The contraction is with respect to \|\cdot\|_\infty. The Bellman operator is not a contraction in the L2L_2 norm in general. This distinction matters for function approximation: least-squares projections minimize L2L_2 error, but the Bellman operator contracts in LL_\infty. The mismatch between the contraction norm and the projection norm is a source of instability in approximate DP.

Watch Out

Convergence of value iteration vs convergence of the greedy policy

Value iteration converges to VV^* at rate γk\gamma^k, but the greedy policy with respect to VkV_k can become optimal long before VkV_k has converged. In many problems, the correct policy is found after a few iterations even though the value estimates are still far from VV^*. This is because policy optimality only requires getting the argmax right, not the exact values.

Key Takeaways

  • The Bellman equations express value functions as fixed points of recursive operators
  • The expectation version is linear (solvable exactly); the optimality version is nonlinear (requires iteration)
  • The Bellman optimality operator is a γ\gamma-contraction in max-norm, guaranteeing unique fixed point and geometric convergence
  • Value iteration applies T\mathcal{T} repeatedly; TD learning and Q-learning use sampled single-step approximations
  • Function approximation breaks the contraction guarantee, creating the deadly triad
  • The discount factor γ\gamma controls both the agent's time preference and the convergence rate

Exercises

ExerciseCore

Problem

Consider a 2-state MDP with states {s1,s2}\{s_1, s_2\}, a single action per state, transitions P(s1s1)=0.6P(s_1|s_1) = 0.6, P(s2s1)=0.4P(s_2|s_1) = 0.4, P(s1s2)=0.3P(s_1|s_2) = 0.3, P(s2s2)=0.7P(s_2|s_2) = 0.7, rewards R(s1)=2R(s_1) = 2, R(s2)=1R(s_2) = 1, and γ=0.9\gamma = 0.9. Write the Bellman expectation equations for Vπ(s1)V^\pi(s_1) and Vπ(s2)V^\pi(s_2) and solve the linear system.

ExerciseCore

Problem

Show that for any two value functions VV and UU, the Bellman expectation operator satisfies TπVTπU=γPπ(VU)γVU\|\mathcal{T}^\pi V - \mathcal{T}^\pi U\|_\infty = \gamma \|P^\pi(V - U)\|_\infty \leq \gamma \|V - U\|_\infty. Why is the inequality sometimes strict?

ExerciseAdvanced

Problem

Suppose you run value iteration with γ=0.95\gamma = 0.95 starting from V0=0V_0 = 0 and the true optimal value satisfies V100\|V^*\|_\infty \leq 100. How many iterations are needed to guarantee VkV0.01\|V_k - V^*\|_\infty \leq 0.01?

ExerciseAdvanced

Problem

Explain why the Bellman optimality operator is a contraction in \|\cdot\|_\infty but not necessarily in 2\|\cdot\|_2. Construct a 2-state example where TVTU2>γVU2\|\mathcal{T}V - \mathcal{T}U\|_2 > \gamma \|V - U\|_2.

Related Comparisons

References

Canonical:

  • Sutton & Barto, Reinforcement Learning: An Introduction (2018), Chapters 3-4
  • Puterman, Markov Decision Processes (1994), Chapters 5-6
  • Bertsekas, Dynamic Programming and Optimal Control (4th ed.), Volume I, Chapter 1

Theory:

  • Bertsekas & Tsitsiklis, Neuro-Dynamic Programming (1996), Chapters 2-3
  • Agarwal, Jiang, Kakade, Sun, Reinforcement Learning: Theory and Algorithms (2022), Chapter 1

Historical:

  • Bellman, Dynamic Programming (1957), the original formulation

Next Topics

  • Value iteration and policy iteration: the algorithms that solve Bellman equations in the tabular setting
  • TD learning: sample-based approximation of the Bellman expectation equation
  • Q-learning: sample-based approximation of the Bellman optimality equation

Last reviewed: April 2026

Prerequisites

Foundations this topic depends on.

Builds on This

Next Topics