Value Iteration and Policy Iteration

Sneiderman, Robby

RL Theory

Value Iteration and Policy Iteration

The two foundational algorithms for solving MDPs exactly: value iteration applies the Bellman optimality operator until convergence, while policy iteration alternates between exact evaluation and greedy improvement.

CoreTier 1StableSupporting~55 min

Prerequisites

Markov Decision Processes Bellman Equations

Start 8-question practice · 6 available 3-question pulse check Prereq Map

Learning position

Read this page in the graph.

rl-theory | layer 2 | tier 1. This page has 2 direct prerequisites and 4 published dependents.

Open Atlas Prerequisites Leads to

What next

Q-Learning

This is the first curated or graph-derived continuation from the current page.

Evidence badge

Claim status

This page has no public Lean mapping yet. Use the evidence page to inspect how claim status labels work.

Show the backing system

AtlasOpen the full prerequisite graph and run grounding traces.EvidenceInspect source support, claim labels, and public trust status.LeanReview the checked declaration list, scopes, and axiom profile.

Why This Matters

Every tabular RL algorithm is either value iteration, policy iteration, or an approximation of one. These are the two ways to turn Bellman equations into algorithms. If you understand MDPs but cannot solve them, the theory is inert. Value iteration and policy iteration are the engines.

Beyond tabular settings, modern deep RL algorithms inherit the structure of these classical methods. DQN is approximate value iteration. PPO is approximate policy iteration. Understanding the exact algorithms clarifies what the approximate versions are trying to do and why they sometimes fail.

Mental Model

Value iteration: Start with any guess for the value function. Repeatedly ask, "If these values were correct, what would the best action be, and what value would that give?" Update the values accordingly. The Bellman contraction theorem guarantees this converges to the true optimal values at a geometric rate.

Policy iteration: Start with any policy. First, compute its exact value function (solve a linear system). Then improve the policy by acting greedily with respect to those values. Repeat. Each step strictly improves the policy until optimality is reached.

Formal Setup and Notation

We work in a finite MDP $(\mathcal{S}, \mathcal{A}, P, R, \gamma)$ with $|\mathcal{S}|$ states and $|\mathcal{A}|$ actions. All definitions of $V^\pi$ , $Q^\pi$ , $V^*$ , and $Q^*$ follow from the MDP topic.

Before the algorithms, keep the objects straight. A state-value function $V^\pi(s)$ is the expected discounted return from state $s$ under policy $\pi$ . An action-value function $Q^\pi(s, a)$ is the expected discounted return from the state-action pair $(s,a)$ and then following $\pi$ . The optimal versions $V^*$ and $Q^*$ take the best achievable return instead of the return under one fixed policy. The full derivations live on Markov decision processes and Bellman equations, but value iteration and policy iteration work by updating these value functions directly.

Definition

Bellman Optimality Operator $T$

The Bellman optimality operator $\mathcal{T}: \mathbb{R}^{|\mathcal{S}|} \to \mathbb{R}^{|\mathcal{S}|}$ acts on value functions:

$(\mathcal{T}V)(s) = \max_{a \in \mathcal{A}} \left[ R(s,a) + \gamma \sum_{s'} P(s'|s,a) \, V(s') \right]$

The optimal value function $V^*$ is the unique fixed point: $\mathcal{T}V^* = V^*$ .

Definition

Bellman Expectation Operator $T^{p} i$

For a fixed policy $\pi$ , the Bellman expectation operator $\mathcal{T}^\pi$ is:

$(\mathcal{T}^\pi V)(s) = \sum_a \pi(a|s) \left[ R(s,a) + \gamma \sum_{s'} P(s'|s,a) \, V(s') \right]$

The value function $V^\pi$ is the unique fixed point: $\mathcal{T}^\pi V^\pi = V^\pi$ .

Definition

Greedy Policy

Given a value function $V$ , the greedy policy $\pi_V$ is:

$\pi_V(s) = \arg\max_{a \in \mathcal{A}} \left[ R(s,a) + \gamma \sum_{s'} P(s'|s,a) \, V(s') \right]$

This is the policy that would be optimal under the assumption that $V$ is the true optimal value function.

Value Iteration

The value iteration algorithm initializes $V_0$ arbitrarily (often to zero) and iterates:

$V_{k+1}(s) = \max_{a \in \mathcal{A}} \left[ R(s,a) + \gamma \sum_{s'} P(s'|s,a) \, V_k(s') \right]$

That is, $V_{k+1} = \mathcal{T} V_k$ . After convergence, extract the optimal policy: $\pi^*(s) = \arg\max_a [R(s,a) + \gamma \sum_{s'} P(s'|s,a) V^*(s')]$ .

Each iteration costs $O(|\mathcal{S}|^2 |\mathcal{A}|)$ . for each state and action, we sum over all next states.

Main Theorems

Theorem

Value Iteration Converges Geometrically

Statement

For any initial $V_0$ , the value iteration sequence $V_{k+1} = \mathcal{T}V_k$ satisfies:

$\|V_k - V^*\|_\infty \leq \gamma^k \|V_0 - V^*\|_\infty$

Therefore, to achieve $\|V_k - V^*\|_\infty \leq \epsilon$ , it suffices to run:

$k \geq \frac{\log(\|V_0 - V^*\|_\infty / \epsilon)}{1 - \gamma} \approx \frac{1}{1-\gamma}\log\frac{R_{\max}}{(1-\gamma)\epsilon}$

iterations, where $R_{\max} = \max_{s,a} |R(s,a)|$ and we use $\|V_0 - V^*\|_\infty \leq R_{\max}/(1-\gamma)$ .

Intuition

The Bellman operator is a $\gamma$ -contraction in the $\ell^\infty$ norm. Each application shrinks the error by a factor of $\gamma$ . Geometric convergence is the direct consequence of the Banach fixed point theorem. The closer $\gamma$ is to 1 (more patient agents), the slower convergence becomes.

Proof Sketch

From the contraction property $\|\mathcal{T}V - \mathcal{T}U\|_\infty \leq \gamma \|V - U\|_\infty$ (proved in the MDP topic), apply inductively:

$\|V_k - V^*\|_\infty = \|\mathcal{T}V_{k-1} - \mathcal{T}V^*\|_\infty \leq \gamma \|V_{k-1} - V^*\|_\infty \leq \cdots \leq \gamma^k \|V_0 - V^*\|_\infty$

For the iteration bound, set $\gamma^k \|V_0 - V^*\|_\infty \leq \epsilon$ and solve for $k$ , using $\log(1/\gamma) \geq 1 - \gamma$ .

Why It Matters

This tells you exactly how many iterations value iteration needs. The $1/(1-\gamma)$ dependence means that long-horizon problems (large $\gamma$ ) are provably harder. Not just heuristically, but provably slower to solve.

Failure Mode

When $\gamma$ is close to 1, convergence can be extremely slow. With $\gamma = 0.999$ , you need roughly 1000 times more iterations than with $\gamma = 0.9$ . For near-undiscounted problems, policy iteration (which does not depend on $\gamma$ for its iteration count) can be vastly more efficient.

report a correction →

Policy Iteration

Policy iteration alternates two phases:

Phase 1. Policy Evaluation. Given the current policy $\pi_k$ , compute $V^{\pi_k}$ exactly by solving the linear system:

$V^{\pi_k} = R^{\pi_k} + \gamma P^{\pi_k} V^{\pi_k}$

which gives $V^{\pi_k} = (I - \gamma P^{\pi_k})^{-1} R^{\pi_k}$ . This costs $O(|\mathcal{S}|^3)$ via direct matrix inversion or can be done iteratively.

Phase 2. Policy Improvement. Define the new policy greedily:

$\pi_{k+1}(s) = \arg\max_{a \in \mathcal{A}} Q^{\pi_k}(s, a) = \arg\max_a \left[ R(s,a) + \gamma \sum_{s'} P(s'|s,a) V^{\pi_k}(s') \right]$

The policy improvement theorem (from the MDP topic) guarantees $V^{\pi_{k+1}}(s) \geq V^{\pi_k}(s)$ for all $s$ .

Theorem

Policy Iteration Terminates in Finite Steps

Statement

Policy iteration terminates after at most $|\mathcal{A}|^{|\mathcal{S}|}$ iterations, producing the optimal policy $\pi^*$ . In practice, convergence is typically much faster. often in $O(|\mathcal{S}| \cdot |\mathcal{A}|)$ or fewer iterations.

Intuition

There are at most $|\mathcal{A}|^{|\mathcal{S}|}$ deterministic policies. The policy improvement theorem guarantees that each iteration strictly improves the value function (unless the policy is already optimal). Since no policy is ever revisited, the algorithm must terminate.

Proof Sketch

By the policy improvement theorem, $V^{\pi_{k+1}}(s) \geq V^{\pi_k}(s)$ for all $s$ , with equality everywhere if and only if $\pi_k$ is optimal. If $\pi_k$ is not optimal, then $V^{\pi_{k+1}}(s) > V^{\pi_k}(s)$ for at least one state. Since value functions uniquely determine policies (in the non-degenerate case) and there are finitely many deterministic policies, the sequence of distinct policies is finite and must reach $\pi^*$ .

Why It Matters

The worst-case bound $|\mathcal{A}|^{|\mathcal{S}|}$ is exponential but is almost never tight. Empirically, policy iteration converges in a small number of iterations (often fewer than 20), independent of the state space size. This makes policy iteration surprisingly efficient despite the per-iteration cost of solving a linear system.

Failure Mode

The per-iteration cost is $O(|\mathcal{S}|^3)$ for exact policy evaluation. For large state spaces, this is prohibitive, and approximate policy evaluation (e.g., a few steps of iterative evaluation) is used instead. This gives "modified policy iteration," which interpolates between value iteration and policy iteration.

report a correction →

Value Iteration vs. Policy Iteration

	Value Iteration	Policy Iteration
Per-iteration cost	$O(\\|\mathcal{S}\\|^2 \\|\mathcal{A}\\|)$	$O(\\|\mathcal{S}\\|^3 + \\|\mathcal{S}\\|^2 \\|\mathcal{A}\\|)$
Number of iterations	$O(\frac{1}{1-\gamma} \log \frac{1}{\epsilon})$	At most $\\|\mathcal{A}\\|^{\\|\mathcal{S}\\|}$ , typically very few
Depends on $\gamma$ ?	Yes. slower as $\gamma \to 1$	Iteration count does not depend on $\gamma$
When to prefer	Small state spaces, moderate $\gamma$	Moderate state spaces, $\gamma$ close to 1

Modified policy iteration is the practical middle ground: instead of solving the linear system exactly, run $m$ steps of iterative policy evaluation (Bellman expectation operator) before improving. When $m = 1$ , this recovers value iteration. When $m = \infty$ , this recovers policy iteration.

Canonical Examples

Example

Grid world: value iteration

Consider a 4x4 grid world with a goal state giving reward +1 and all other transitions giving reward 0, with $\gamma = 0.9$ . Initialize $V_0(s) = 0$ for all states. After one iteration, only the states adjacent to the goal have nonzero values ( $V_1 = 0.9$ for neighbors). After $k$ iterations, the value propagates outward. states $k$ steps from the goal acquire nonzero values. After about 30 iterations ( $\approx 1/(1-0.9) \cdot \log(1/\epsilon)$ ), the values have converged to within $\epsilon$ .

Example

Grid world: policy iteration

Same grid world. Initialize with a random policy. After policy evaluation (solving a 16x16 linear system), the value function reveals which states are closer to the goal. Policy improvement makes every state point toward higher value. Typically, 2-3 iterations of policy iteration suffice, compared to 30+ for value iteration. Each iteration is more expensive but there are far fewer.

Common Confusions

Watch Out

Value iteration does not maintain an explicit policy

Value iteration updates value functions directly. There is no policy variable during the iterations. The policy is extracted only at the end by acting greedily with respect to the converged values. In contrast, policy iteration maintains an explicit policy at every step.

Watch Out

Policy iteration convergence is not geometric

Value iteration converges at geometric rate $\gamma$ . Policy iteration does not have a rate in the same sense. It converges in a finite but potentially exponential number of steps. The practical observation that policy iteration converges in very few iterations is empirical, not a consequence of the contraction theorem.

Watch Out

Neither algorithm scales to large state spaces

Both algorithms require iterating over all states and actions. For an Atari game with $\sim 10^{60}$ possible screens, exact value iteration and policy iteration are impossible. This is why we need function approximation (deep Q-networks, policy gradient methods). The next topics in the RL sequence.

Summary

Value iteration: $V_{k+1}(s) = \max_a [R(s,a) + \gamma \sum_{s'} P(s'|s,a) V_k(s')]$
Converges at geometric rate $\gamma^k$ by the contraction theorem
Policy iteration: evaluate $\pi_k$ exactly, then improve greedily
Terminates in at most $|\mathcal{A}|^{|\mathcal{S}|}$ steps, usually much fewer
Value iteration is cheaper per iteration; policy iteration needs fewer iterations
Modified policy iteration interpolates between the two extremes

Exercises

ExerciseCore

Problem

Consider an MDP with 3 states, 2 actions, and $\gamma = 0.5$ . If $V_0(s) = 0$ for all $s$ and $R_{\max} = 1$ , how many iterations of value iteration are needed to guarantee $\|V_k - V^*\|_\infty \leq 0.01$ ?

ExerciseCore

Problem

Why does setting $m = 1$ in modified policy iteration recover value iteration? Write out the update and show it matches.

ExerciseAdvanced

Problem

Prove that value iteration with $\gamma = 0.99$ requires at least 10 times more iterations than with $\gamma = 0.9$ to achieve the same accuracy, assuming the same MDP structure and $R_{\max}$ .

Related Comparisons

Value Iteration vs. Policy Iteration

References

Canonical:

Puterman, Markov Decision Processes (1994), Chapters 6.2-6.4
Sutton & Barto, Reinforcement Learning: An Introduction (2018), Chapter 4
Bertsekas, Dynamic Programming and Optimal Control (4th ed.), Volume I, Chapters 1 and 2

Current:

Agarwal, Jiang, Kakade, Sun, Reinforcement Learning: Theory and Algorithms (2022), Chapter 1
Bertsekas, Dynamic Programming and Optimal Control (4th ed.), Volume II, Chapter 1
Szepesvari, Algorithms for Reinforcement Learning (2010), Chapter 2
Russell & Norvig, Artificial Intelligence: A Modern Approach (4th ed.), Chapter 17
Littman, Dean, Kaelbling, "On the Complexity of Solving Markov Decision Problems" (UAI, 1995)

Next Topics

The natural next steps from value iteration and policy iteration:

Q-learning: model-free value iteration using sampled transitions
Policy gradient theorem: optimizing parameterized policies via gradient ascent

Last reviewed: April 13, 2026

Canonical graph

Required before and derived from this topic

These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.

Full prerequisite chain All derived topics

Required prerequisites

2

Bellman Equationslayer 2 · tier 1
Markov Decision Processeslayer 2 · tier 1

Derived topics

4

Q-Learninglayer 2 · tier 1
Policy Gradient Theoremlayer 3 · tier 1
Temporal Difference Learninglayer 2 · tier 2
Options and Temporal Abstractionlayer 3 · tier 3

Graph-backed continuations

Q-Learning Policy Gradient Theorem Options and Temporal Abstraction Temporal Difference Learning