Bellman Equations

Sneiderman, Robby

RL Theory

Bellman Equations

The recursive backbone of RL. State-value and action-value Bellman equations, the contraction mapping property, convergence of value iteration, and why recursive decomposition is the central idea in sequential decision-making.

CoreTier 1StableCore spine~45 min

Prerequisites

Markov Decision Processes Expectation Variance Covariance Moments

Start 8-question practice · 12 available 3-question pulse check Prereq Map

Learning position

Read this page in the graph.

rl-theory | layer 2 | tier 1. This page has 2 direct prerequisites and 5 published dependents.

Open Atlas Prerequisites Leads to

What next

Value Iteration and Policy Iteration

This is the first curated or graph-derived continuation from the current page.

Evidence badge

Claim status

This page has no public Lean mapping yet. Use the evidence page to inspect how claim status labels work.

Show the backing system

AtlasOpen the full prerequisite graph and run grounding traces.EvidenceInspect source support, claim labels, and public trust status.LeanReview the checked declaration list, scopes, and axiom profile.

Why This Matters

Every reinforcement learning algorithm rests on the Bellman equations. Value iteration, policy iteration, Q-learning, SARSA, TD learning, actor-critic methods: all of them are either solving or approximating solutions to Bellman equations.

The key insight is recursive decomposition. The value of being in a state equals the immediate reward plus the discounted value of the next state. This single idea, expressed as a fixed-point equation, transforms an infinite-horizon optimization problem into a tractable recursion. Without it, you would need to enumerate all possible future trajectories, which is impossible in any nontrivial environment.

If you understand the Bellman equations and their contraction properties, you understand why RL algorithms converge and when they fail.

Prerequisites

This page assumes familiarity with Markov decision processes (states, actions, transitions, rewards, discount factor, policies) and basic expectation and variance.

Core Definitions

Definition

State-Value Function $V^{π} (s)$

The state-value function for policy $\pi$ gives the expected discounted return starting from state $s$ and following $\pi$ thereafter:

$V^\pi(s) = \mathbb{E}_\pi\left[\sum_{k=0}^{\infty} \gamma^k r_{t+k} \;\middle|\; s_t = s\right]$

This is well-defined when $\gamma \in [0, 1)$ and rewards are bounded.

Definition

Action-Value Function $Q^{π} (s, a)$

The action-value function for policy $\pi$ gives the expected discounted return starting from state $s$ , taking action $a$ , and then following $\pi$ :

$Q^\pi(s,a) = \mathbb{E}_\pi\left[\sum_{k=0}^{\infty} \gamma^k r_{t+k} \;\middle|\; s_t = s, a_t = a\right]$

The relationship between $V^\pi$ and $Q^\pi$ is:

$V^\pi(s) = \sum_{a \in \mathcal{A}} \pi(a|s) \, Q^\pi(s, a)$

Definition

Optimal Value Functions $V^{*} (s), Q^{*} (s, a)$

The optimal state-value function is $V^*(s) = \max_\pi V^\pi(s)$ for all $s$ . The optimal action-value function is $Q^*(s,a) = \max_\pi Q^\pi(s,a)$ for all $s, a$ . An optimal policy $\pi^*$ achieves $V^{\pi^*}(s) = V^*(s)$ for all states simultaneously.

The Four Bellman Equations

There are four Bellman equations: expectation and optimality versions for both $V$ and $Q$ . All four express the same recursive structure.

Bellman Expectation Equations

For a fixed policy $\pi$ :

$V^\pi(s) = \sum_{a} \pi(a|s) \left[ R(s,a) + \gamma \sum_{s'} P(s'|s,a) \, V^\pi(s') \right]$

$Q^\pi(s,a) = R(s,a) + \gamma \sum_{s'} P(s'|s,a) \sum_{a'} \pi(a'|s') \, Q^\pi(s',a')$

These are linear systems. For a fixed policy in a finite MDP, the Bellman expectation equation for $V^\pi$ is a system of $|\mathcal{S}|$ linear equations in $|\mathcal{S}|$ unknowns, solvable by matrix inversion: $V^\pi = (I - \gamma P^\pi)^{-1} R^\pi$ .

Bellman Optimality Equations

$V^*(s) = \max_{a \in \mathcal{A}} \left[ R(s,a) + \gamma \sum_{s'} P(s'|s,a) \, V^*(s') \right]$

$Q^*(s,a) = R(s,a) + \gamma \sum_{s'} P(s'|s,a) \max_{a'} Q^*(s',a')$

These are nonlinear because of the max operator. They cannot be solved by matrix inversion. Instead, we solve them iteratively via value iteration.

Main Theorems

Theorem

Bellman Optimality Operator is a Gamma-Contraction

Statement

Define the Bellman optimality operator $\mathcal{T}: \mathbb{R}^{|\mathcal{S}|} \to \mathbb{R}^{|\mathcal{S}|}$ by:

$(\mathcal{T}V)(s) = \max_{a \in \mathcal{A}} \left[ R(s,a) + \gamma \sum_{s'} P(s'|s,a) \, V(s') \right]$

Then $\mathcal{T}$ is a $\gamma$ -contraction in the $\ell^\infty$ (max-norm):

$\|\mathcal{T}V - \mathcal{T}U\|_\infty \leq \gamma \|V - U\|_\infty$

By the Banach fixed-point theorem, $\mathcal{T}$ has a unique fixed point $V^*$ , and the sequence $V_{k+1} = \mathcal{T}V_k$ converges to $V^*$ geometrically:

$\|V_k - V^*\|_\infty \leq \gamma^k \|V_0 - V^*\|_\infty$

Intuition

Each application of the Bellman update shrinks the distance between any value estimate and the true optimal value by a factor of $\gamma$ . The discount factor does the work: it ensures that errors in distant future values are damped. After $k$ iterations, the error is at most $\gamma^k$ times the initial error.

Proof Sketch

For any $V, U \in \mathbb{R}^{|\mathcal{S}|}$ and any state $s$ , let $a^*$ be the action achieving the max for $\mathcal{T}V$ at state $s$ . Then:

$(\mathcal{T}V)(s) - (\mathcal{T}U)(s) \leq R(s,a^*) + \gamma \sum_{s'} P(s'|s,a^*) V(s') - R(s,a^*) - \gamma \sum_{s'} P(s'|s,a^*) U(s')$

$= \gamma \sum_{s'} P(s'|s,a^*)(V(s') - U(s')) \leq \gamma \|V - U\|_\infty$

The inequality uses that $P(\cdot|s,a^*)$ is a probability distribution, so the weighted average of $V(s') - U(s')$ is at most the maximum. By symmetry (swapping $V$ and $U$ ), we get $|(\mathcal{T}V)(s) - (\mathcal{T}U)(s)| \leq \gamma \|V - U\|_\infty$ for all $s$ .

Why It Matters

This theorem is the mathematical foundation of value iteration. It guarantees three things: (1) the optimal value function $V^*$ exists and is unique, (2) value iteration converges to it from any initialization, and (3) the convergence rate is geometric with ratio $\gamma$ . Without this result, there would be no guarantee that iterating the Bellman update produces anything useful.

A separate but related fact: in any finite MDP with $\gamma \in [0,1)$ , there exists a deterministic stationary optimal policy, obtained as $\pi^*(s) \in \arg\max_a [R(s,a) + \gamma \sum_{s'} P(s'|s,a) V^*(s')]$ . This is why downstream algorithms (policy iteration, $Q$ -learning) restrict attention to deterministic policies without loss of optimality. The existence of $V^*$ is what makes this argmax well-defined; randomization is only required when the action set is infinite or when the MDP has additional structure (constraints, partial observability).

Failure Mode

When $\gamma = 1$ (no discounting), the Bellman operator is no longer a contraction. Value iteration can diverge or oscillate. The undiscounted case requires additional structure, such as guaranteeing that all policies eventually reach a terminal state (episodic setting). For average-reward MDPs with $\gamma = 1$ , different fixed-point conditions and algorithms are needed.

With function approximation (neural networks instead of tables), the contraction property can be lost. The composition of a contraction (Bellman update) with a projection (function approximation) need not be a contraction. This is one leg of the "deadly triad" in RL.

report a correction →

Proposition

Bellman Expectation Operator Contraction

Statement

The Bellman expectation operator $\mathcal{T}^\pi V = R^\pi + \gamma P^\pi V$ is also a $\gamma$ -contraction in the $\ell^\infty$ norm:

$\|\mathcal{T}^\pi V - \mathcal{T}^\pi U\|_\infty \leq \gamma \|V - U\|_\infty$

Therefore $V^\pi$ is the unique fixed point of $\mathcal{T}^\pi$ , and iterative policy evaluation converges to $V^\pi$ .

Intuition

The proof is simpler than for the optimality operator because there is no max. The operator $\mathcal{T}^\pi$ is linear: $\mathcal{T}^\pi V - \mathcal{T}^\pi U = \gamma P^\pi (V - U)$ . Since $P^\pi$ is a stochastic matrix, it cannot increase the max-norm.

Proof Sketch

$\|\mathcal{T}^\pi V - \mathcal{T}^\pi U\|_\infty = \|\gamma P^\pi(V - U)\|_\infty \leq \gamma \|P^\pi(V-U)\|_\infty \leq \gamma \|V - U\|_\infty$ . The last step uses that each row of $P^\pi$ sums to 1, so the weighted average of entries of $V - U$ cannot exceed $\|V - U\|_\infty$ .

Why It Matters

This guarantees that policy evaluation (computing $V^\pi$ for a given policy) converges. Policy evaluation is a subroutine of policy iteration, and approximate versions of it appear in TD learning and actor-critic methods.

report a correction →

Value Iteration as Repeated Operator Application

Value iteration is the algorithm that repeatedly applies $\mathcal{T}$ :

Initialize $V_0$ arbitrarily (e.g., $V_0 = 0$ )
For $k = 0, 1, 2, \ldots$ : compute $V_{k+1}(s) = \max_a [R(s,a) + \gamma \sum_{s'} P(s'|s,a) V_k(s')]$ for all $s$
Stop when $\|V_{k+1} - V_k\|_\infty < \epsilon(1 - \gamma) / (2\gamma)$
Extract policy: $\pi^*(s) = \arg\max_a [R(s,a) + \gamma \sum_{s'} P(s'|s,a) V_k(s')]$

The stopping criterion in step 3 guarantees the extracted policy is $\epsilon$ -optimal.

Complexity per iteration: $O(|\mathcal{S}|^2 |\mathcal{A}|)$ , since for each state we must evaluate each action and sum over successor states. Iterations to convergence: $O(\frac{1}{1 - \gamma} \log \frac{1}{\epsilon(1 - \gamma)})$ .

Connection to Dynamic Programming

The Bellman equations are the stochastic generalization of the DP recurrence. In deterministic DP, the transition is certain and the Bellman equation becomes:

$V^*(s) = \max_a [R(s,a) + \gamma V^*(f(s,a))]$

where $f(s,a)$ is the deterministic successor. Shortest-path algorithms (Dijkstra, Bellman-Ford), sequence alignment, and optimal control all solve specific instances of this equation.

The Bellman-Ford algorithm for shortest paths is literally value iteration on a deterministic MDP with $\gamma = 1$ and a guarantee of termination (finite horizon or no negative cycles).

Connection to TD Learning

TD(0) updates approximate the Bellman expectation equation using a single sample:

$V(s_t) \leftarrow V(s_t) + \alpha \left[ r_t + \gamma V(s_{t+1}) - V(s_t) \right]$

The quantity $\delta_t = r_t + \gamma V(s_{t+1}) - V(s_t)$ is the TD error. It is a noisy, sampled version of the Bellman residual $\mathcal{T}^\pi V(s_t) - V(s_t)$ . TD learning converges to the fixed point of $\mathcal{T}^\pi$ under appropriate step-size conditions (Robbins-Monro conditions: $\sum \alpha_t = \infty$ , $\sum \alpha_t^2 < \infty$ ).

Q-learning similarly approximates the Bellman optimality equation:

$Q(s_t, a_t) \leftarrow Q(s_t, a_t) + \alpha \left[ r_t + \gamma \max_{a'} Q(s_{t+1}, a') - Q(s_t, a_t) \right]$

This converges to $Q^*$ under the same step-size conditions plus a requirement that all state-action pairs are visited infinitely often.

The Curse of Dimensionality

The Bellman equations are exact but require representing a value for every state (or state-action pair). For a continuous state space or a state space that grows exponentially with the number of state variables, exact solution is impossible. This is Bellman's own "curse of dimensionality."

Function approximation addresses this by parameterizing $V(s; \theta)$ or $Q(s, a; \theta)$ with a neural network or linear model. The Bellman update becomes a regression target:

$\theta_{k+1} = \arg\min_\theta \sum_s \left( V(s; \theta) - \mathcal{T}V_k(s) \right)^2$

This introduces the deadly triad: the combination of (1) function approximation, (2) bootstrapping (using $V_k$ in the target), and (3) off-policy learning can cause divergence. The Bellman operator is still a contraction, but projecting onto the function approximation class can undo the contraction. This remains one of the central open problems in RL theory.

Common Confusions

Watch Out

Bellman expectation vs Bellman optimality equation

The expectation equation holds for a fixed policy $\pi$ and is linear in $V$ . The optimality equation involves a max over actions and is nonlinear. The expectation equation can be solved by matrix inversion; the optimality equation requires iterative methods. Students often conflate the two, but they serve different purposes: the expectation equation evaluates a given policy, while the optimality equation finds the best one.

Watch Out

The Bellman operator contracts in max-norm, not L2-norm

The contraction is with respect to $\|\cdot\|_\infty$ . The Bellman operator is not a contraction in the $L_2$ norm in general. This distinction matters for function approximation: least-squares projections minimize $L_2$ error, but the Bellman operator contracts in $L_\infty$ . The mismatch between the contraction norm and the projection norm is a source of instability in approximate DP.

Watch Out

Convergence of value iteration vs convergence of the greedy policy

Value iteration converges to $V^*$ at rate $\gamma^k$ , but the greedy policy with respect to $V_k$ can become optimal long before $V_k$ has converged. In many problems, the correct policy is found after a few iterations even though the value estimates are still far from $V^*$ . This is because policy optimality only requires getting the argmax right, not the exact values.

Summary

The Bellman equations express value functions as fixed points of recursive operators
The expectation version is linear (solvable exactly); the optimality version is nonlinear (requires iteration)
The Bellman optimality operator is a $\gamma$ -contraction in max-norm, guaranteeing unique fixed point and geometric convergence
Value iteration applies $\mathcal{T}$ repeatedly; TD learning and Q-learning use sampled single-step approximations
Function approximation breaks the contraction guarantee, creating the deadly triad
The discount factor $\gamma$ controls both the agent's time preference and the convergence rate

Exercises

ExerciseCore

Problem

Consider a 2-state MDP with states $\{s_1, s_2\}$ , a single action per state, transitions $P(s_1|s_1) = 0.6$ , $P(s_2|s_1) = 0.4$ , $P(s_1|s_2) = 0.3$ , $P(s_2|s_2) = 0.7$ , rewards $R(s_1) = 2$ , $R(s_2) = 1$ , and $\gamma = 0.9$ . Write the Bellman expectation equations for $V^\pi(s_1)$ and $V^\pi(s_2)$ and solve the linear system.

ExerciseCore

Problem

Show that for any two value functions $V$ and $U$ , the Bellman expectation operator satisfies $\|\mathcal{T}^\pi V - \mathcal{T}^\pi U\|_\infty = \gamma \|P^\pi(V - U)\|_\infty \leq \gamma \|V - U\|_\infty$ . Why is the inequality sometimes strict?

ExerciseAdvanced

Problem

Suppose you run value iteration with $\gamma = 0.95$ starting from $V_0 = 0$ and the true optimal value satisfies $\|V^*\|_\infty \leq 100$ . How many iterations are needed to guarantee $\|V_k - V^*\|_\infty \leq 0.01$ ?

ExerciseAdvanced

Problem

Explain why the Bellman optimality operator is a contraction in $\|\cdot\|_\infty$ but not necessarily in $\|\cdot\|_2$ . Construct a 2-state example where $\|\mathcal{T}V - \mathcal{T}U\|_2 > \gamma \|V - U\|_2$ .

Related Comparisons

Value Iteration vs. Policy Iteration

References

Canonical:

Sutton & Barto, Reinforcement Learning: An Introduction (2018), Chapters 3-4
Puterman, Markov Decision Processes (1994), Chapters 5-6
Bertsekas, Dynamic Programming and Optimal Control (4th ed.), Volume I, Chapter 1

Theory:

Bertsekas & Tsitsiklis, Neuro-Dynamic Programming (1996), Chapters 2-3
Agarwal, Jiang, Kakade, Sun, Reinforcement Learning: Theory and Algorithms (2022), Chapter 1

Historical:

Bellman, Dynamic Programming (1957), the original formulation

Next Topics

Value iteration and policy iteration: the algorithms that solve Bellman equations in the tabular setting
TD learning: sample-based approximation of the Bellman expectation equation
Q-learning: sample-based approximation of the Bellman optimality equation

Last reviewed: April 15, 2026

Canonical graph

Required before and derived from this topic

These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.

Full prerequisite chain All derived topics

Required prerequisites

2

Expectation, Variance, Covariance, and Momentslayer 0A · tier 1
Markov Decision Processeslayer 2 · tier 1

Derived topics

5

Q-Learninglayer 2 · tier 1
Value Iteration and Policy Iterationlayer 2 · tier 1
Reward Design and Reward Misspecificationlayer 3 · tier 1
Temporal Difference Learninglayer 2 · tier 2
Model-Based Reinforcement Learninglayer 3 · tier 2

Graph-backed continuations

Value Iteration and Policy Iteration Temporal Difference Learning Q-Learning Model-Based Reinforcement Learning Reward Design and Reward Misspecification