Reinforcement Learning
Bellman Equations
The recursive backbone of RL. State-value and action-value Bellman equations, the contraction mapping property, convergence of value iteration, and why recursive decomposition is the central idea in sequential decision-making.
Why This Matters
Every reinforcement learning algorithm rests on the Bellman equations. Value iteration, policy iteration, Q-learning, SARSA, TD learning, actor-critic methods: all of them are either solving or approximating solutions to Bellman equations.
The key insight is recursive decomposition. The value of being in a state equals the immediate reward plus the discounted value of the next state. This single idea, expressed as a fixed-point equation, transforms an infinite-horizon optimization problem into a tractable recursion. Without it, you would need to enumerate all possible future trajectories, which is impossible in any nontrivial environment.
If you understand the Bellman equations and their contraction properties, you understand why RL algorithms converge and when they fail.
Prerequisites
This page assumes familiarity with Markov decision processes (states, actions, transitions, rewards, discount factor, policies) and basic expectation and variance.
Core Definitions
State-Value Function
The state-value function for policy gives the expected discounted return starting from state and following thereafter:
This is well-defined when and rewards are bounded.
Action-Value Function
The action-value function for policy gives the expected discounted return starting from state , taking action , and then following :
The relationship between and is:
Optimal Value Functions
The optimal state-value function is for all . The optimal action-value function is for all . An optimal policy achieves for all states simultaneously.
The Four Bellman Equations
There are four Bellman equations: expectation and optimality versions for both and . All four express the same recursive structure.
Bellman Expectation Equations
For a fixed policy :
These are linear systems. For a fixed policy in a finite MDP, the Bellman expectation equation for is a system of linear equations in unknowns, solvable by matrix inversion: .
Bellman Optimality Equations
These are nonlinear because of the max operator. They cannot be solved by matrix inversion. Instead, we solve them iteratively via value iteration.
Main Theorems
Bellman Optimality Operator is a Gamma-Contraction
Statement
Define the Bellman optimality operator by:
Then is a -contraction in the (max-norm):
By the Banach fixed-point theorem, has a unique fixed point , and the sequence converges to geometrically:
Intuition
Each application of the Bellman update shrinks the distance between any value estimate and the true optimal value by a factor of . The discount factor does the work: it ensures that errors in distant future values are damped. After iterations, the error is at most times the initial error.
Proof Sketch
For any and any state , let be the action achieving the max for at state . Then:
The inequality uses that is a probability distribution, so the weighted average of is at most the maximum. By symmetry (swapping and ), we get for all .
Why It Matters
This theorem is the mathematical foundation of value iteration. It guarantees three things: (1) the optimal value function exists and is unique, (2) value iteration converges to it from any initialization, and (3) the convergence rate is geometric with ratio . Without this result, there would be no guarantee that iterating the Bellman update produces anything useful.
Failure Mode
When (no discounting), the Bellman operator is no longer a contraction. Value iteration can diverge or oscillate. The undiscounted case requires additional structure, such as guaranteeing that all policies eventually reach a terminal state (episodic setting). For average-reward MDPs with , different fixed-point conditions and algorithms are needed.
With function approximation (neural networks instead of tables), the contraction property can be lost. The composition of a contraction (Bellman update) with a projection (function approximation) need not be a contraction. This is one leg of the "deadly triad" in RL.
Bellman Expectation Operator Contraction
Statement
The Bellman expectation operator is also a -contraction in the norm:
Therefore is the unique fixed point of , and iterative policy evaluation converges to .
Intuition
The proof is simpler than for the optimality operator because there is no max. The operator is linear: . Since is a stochastic matrix, it cannot increase the max-norm.
Proof Sketch
. The last step uses that each row of sums to 1, so the weighted average of entries of cannot exceed .
Why It Matters
This guarantees that policy evaluation (computing for a given policy) converges. Policy evaluation is a subroutine of policy iteration, and approximate versions of it appear in TD learning and actor-critic methods.
Value Iteration as Repeated Operator Application
Value iteration is the algorithm that repeatedly applies :
- Initialize arbitrarily (e.g., )
- For : compute for all
- Stop when
- Extract policy:
The stopping criterion in step 3 guarantees the extracted policy is -optimal.
Complexity per iteration: , since for each state we must evaluate each action and sum over successor states. Iterations to convergence: .
Connection to Dynamic Programming
The Bellman equations are the stochastic generalization of the DP recurrence. In deterministic DP, the transition is certain and the Bellman equation becomes:
where is the deterministic successor. Shortest-path algorithms (Dijkstra, Bellman-Ford), sequence alignment, and optimal control all solve specific instances of this equation.
The Bellman-Ford algorithm for shortest paths is literally value iteration on a deterministic MDP with and a guarantee of termination (finite horizon or no negative cycles).
Connection to TD Learning
TD(0) updates approximate the Bellman expectation equation using a single sample:
The quantity is the TD error. It is a noisy, sampled version of the Bellman residual . TD learning converges to the fixed point of under appropriate step-size conditions (Robbins-Monro conditions: , ).
Q-learning similarly approximates the Bellman optimality equation:
This converges to under the same step-size conditions plus a requirement that all state-action pairs are visited infinitely often.
The Curse of Dimensionality
The Bellman equations are exact but require representing a value for every state (or state-action pair). For a continuous state space or a state space that grows exponentially with the number of state variables, exact solution is impossible. This is Bellman's own "curse of dimensionality."
Function approximation addresses this by parameterizing or with a neural network or linear model. The Bellman update becomes a regression target:
This introduces the deadly triad: the combination of (1) function approximation, (2) bootstrapping (using in the target), and (3) off-policy learning can cause divergence. The Bellman operator is still a contraction, but projecting onto the function approximation class can undo the contraction. This remains one of the central open problems in RL theory.
Common Confusions
Bellman expectation vs Bellman optimality equation
The expectation equation holds for a fixed policy and is linear in . The optimality equation involves a max over actions and is nonlinear. The expectation equation can be solved by matrix inversion; the optimality equation requires iterative methods. Students often conflate the two, but they serve different purposes: the expectation equation evaluates a given policy, while the optimality equation finds the best one.
The Bellman operator contracts in max-norm, not L2-norm
The contraction is with respect to . The Bellman operator is not a contraction in the norm in general. This distinction matters for function approximation: least-squares projections minimize error, but the Bellman operator contracts in . The mismatch between the contraction norm and the projection norm is a source of instability in approximate DP.
Convergence of value iteration vs convergence of the greedy policy
Value iteration converges to at rate , but the greedy policy with respect to can become optimal long before has converged. In many problems, the correct policy is found after a few iterations even though the value estimates are still far from . This is because policy optimality only requires getting the argmax right, not the exact values.
Key Takeaways
- The Bellman equations express value functions as fixed points of recursive operators
- The expectation version is linear (solvable exactly); the optimality version is nonlinear (requires iteration)
- The Bellman optimality operator is a -contraction in max-norm, guaranteeing unique fixed point and geometric convergence
- Value iteration applies repeatedly; TD learning and Q-learning use sampled single-step approximations
- Function approximation breaks the contraction guarantee, creating the deadly triad
- The discount factor controls both the agent's time preference and the convergence rate
Exercises
Problem
Consider a 2-state MDP with states , a single action per state, transitions , , , , rewards , , and . Write the Bellman expectation equations for and and solve the linear system.
Problem
Show that for any two value functions and , the Bellman expectation operator satisfies . Why is the inequality sometimes strict?
Problem
Suppose you run value iteration with starting from and the true optimal value satisfies . How many iterations are needed to guarantee ?
Problem
Explain why the Bellman optimality operator is a contraction in but not necessarily in . Construct a 2-state example where .
Related Comparisons
References
Canonical:
- Sutton & Barto, Reinforcement Learning: An Introduction (2018), Chapters 3-4
- Puterman, Markov Decision Processes (1994), Chapters 5-6
- Bertsekas, Dynamic Programming and Optimal Control (4th ed.), Volume I, Chapter 1
Theory:
- Bertsekas & Tsitsiklis, Neuro-Dynamic Programming (1996), Chapters 2-3
- Agarwal, Jiang, Kakade, Sun, Reinforcement Learning: Theory and Algorithms (2022), Chapter 1
Historical:
- Bellman, Dynamic Programming (1957), the original formulation
Next Topics
- Value iteration and policy iteration: the algorithms that solve Bellman equations in the tabular setting
- TD learning: sample-based approximation of the Bellman expectation equation
- Q-learning: sample-based approximation of the Bellman optimality equation
Last reviewed: April 2026
Prerequisites
Foundations this topic depends on.
- Markov Decision ProcessesLayer 2
- Convex Optimization BasicsLayer 1
- Differentiation in RnLayer 0A
- Sets, Functions, and RelationsLayer 0A
- Basic Logic and Proof TechniquesLayer 0A
- Matrix Operations and PropertiesLayer 0A
- Concentration InequalitiesLayer 1
- Common Probability DistributionsLayer 0A
- Expectation, Variance, Covariance, and MomentsLayer 0A