Beta. Content is under active construction and has not been peer-reviewed. Report errors on GitHub.Disclaimer

RL Theory

Value Iteration and Policy Iteration

The two foundational algorithms for solving MDPs exactly: value iteration applies the Bellman optimality operator until convergence, while policy iteration alternates between exact evaluation and greedy improvement.

CoreTier 1Stable~55 min

Why This Matters

Every tabular RL algorithm is either value iteration, policy iteration, or an approximation of one. These are the two ways to turn Bellman equations into algorithms. If you understand MDPs but cannot solve them, the theory is inert. Value iteration and policy iteration are the engines.

Beyond tabular settings, modern deep RL algorithms inherit the structure of these classical methods. DQN is approximate value iteration. PPO is approximate policy iteration. Understanding the exact algorithms clarifies what the approximate versions are trying to do and why they sometimes fail.

Value IterationInitialize V(s) = 0Bellman update:V(s) = max_a [R + γ ∑ P V(s')]Repeat until|V_new - V_old| < εExtract policy:π(s) = argmax_a [...]Policy IterationInitialize π arbitrarilyPolicy evaluation:solve V^π exactlyPolicy improvement:π'(s) = argmax_a [...]If π' = π, stop.Else π = π', repeat.Value iteration: many small updates. Policy iteration: fewer iterations but each is expensive (solves a linear system).

Mental Model

Value iteration: Start with any guess for the value function. Repeatedly ask, "If these values were correct, what would the best action be, and what value would that give?" Update the values accordingly. The Bellman contraction theorem guarantees this converges to the true optimal values at a geometric rate.

Policy iteration: Start with any policy. First, compute its exact value function (solve a linear system). Then improve the policy by acting greedily with respect to those values. Repeat. Each step strictly improves the policy until optimality is reached.

Formal Setup and Notation

We work in a finite MDP (S,A,P,R,γ)(\mathcal{S}, \mathcal{A}, P, R, \gamma) with S|\mathcal{S}| states and A|\mathcal{A}| actions. All definitions of VπV^\pi, QπQ^\pi, VV^*, and QQ^* follow from the MDP topic.

Definition

Bellman Optimality Operator

The Bellman optimality operator T:RSRS\mathcal{T}: \mathbb{R}^{|\mathcal{S}|} \to \mathbb{R}^{|\mathcal{S}|} acts on value functions:

(TV)(s)=maxaA[R(s,a)+γsP(ss,a)V(s)](\mathcal{T}V)(s) = \max_{a \in \mathcal{A}} \left[ R(s,a) + \gamma \sum_{s'} P(s'|s,a) \, V(s') \right]

The optimal value function VV^* is the unique fixed point: TV=V\mathcal{T}V^* = V^*.

Definition

Bellman Expectation Operator

For a fixed policy π\pi, the Bellman expectation operator Tπ\mathcal{T}^\pi is:

(TπV)(s)=aπ(as)[R(s,a)+γsP(ss,a)V(s)](\mathcal{T}^\pi V)(s) = \sum_a \pi(a|s) \left[ R(s,a) + \gamma \sum_{s'} P(s'|s,a) \, V(s') \right]

The value function VπV^\pi is the unique fixed point: TπVπ=Vπ\mathcal{T}^\pi V^\pi = V^\pi.

Definition

Greedy Policy

Given a value function VV, the greedy policy πV\pi_V is:

πV(s)=argmaxaA[R(s,a)+γsP(ss,a)V(s)]\pi_V(s) = \arg\max_{a \in \mathcal{A}} \left[ R(s,a) + \gamma \sum_{s'} P(s'|s,a) \, V(s') \right]

This is the policy that would be optimal if VV were the true optimal value function.

Value Iteration

The value iteration algorithm initializes V0V_0 arbitrarily (often to zero) and iterates:

Vk+1(s)=maxaA[R(s,a)+γsP(ss,a)Vk(s)]V_{k+1}(s) = \max_{a \in \mathcal{A}} \left[ R(s,a) + \gamma \sum_{s'} P(s'|s,a) \, V_k(s') \right]

That is, Vk+1=TVkV_{k+1} = \mathcal{T} V_k. After convergence, extract the optimal policy: π(s)=argmaxa[R(s,a)+γsP(ss,a)V(s)]\pi^*(s) = \arg\max_a [R(s,a) + \gamma \sum_{s'} P(s'|s,a) V^*(s')].

Each iteration costs O(S2A)O(|\mathcal{S}|^2 |\mathcal{A}|). for each state and action, we sum over all next states.

Main Theorems

Theorem

Value Iteration Converges Geometrically

Statement

For any initial V0V_0, the value iteration sequence Vk+1=TVkV_{k+1} = \mathcal{T}V_k satisfies:

VkVγkV0V\|V_k - V^*\|_\infty \leq \gamma^k \|V_0 - V^*\|_\infty

Therefore, to achieve VkVϵ\|V_k - V^*\|_\infty \leq \epsilon, it suffices to run:

klog(V0V/ϵ)1γ11γlogRmax(1γ)ϵk \geq \frac{\log(\|V_0 - V^*\|_\infty / \epsilon)}{1 - \gamma} \approx \frac{1}{1-\gamma}\log\frac{R_{\max}}{(1-\gamma)\epsilon}

iterations, where Rmax=maxs,aR(s,a)R_{\max} = \max_{s,a} |R(s,a)| and we use V0VRmax/(1γ)\|V_0 - V^*\|_\infty \leq R_{\max}/(1-\gamma).

Intuition

The Bellman operator is a γ\gamma-contraction in the \ell^\infty norm. Each application shrinks the error by a factor of γ\gamma. Geometric convergence is the direct consequence of the Banach fixed point theorem. The closer γ\gamma is to 1 (more patient agents), the slower convergence becomes.

Proof Sketch

From the contraction property TVTUγVU\|\mathcal{T}V - \mathcal{T}U\|_\infty \leq \gamma \|V - U\|_\infty (proved in the MDP topic), apply inductively:

VkV=TVk1TVγVk1VγkV0V\|V_k - V^*\|_\infty = \|\mathcal{T}V_{k-1} - \mathcal{T}V^*\|_\infty \leq \gamma \|V_{k-1} - V^*\|_\infty \leq \cdots \leq \gamma^k \|V_0 - V^*\|_\infty

For the iteration bound, set γkV0Vϵ\gamma^k \|V_0 - V^*\|_\infty \leq \epsilon and solve for kk, using log(1/γ)1γ\log(1/\gamma) \geq 1 - \gamma.

Why It Matters

This tells you exactly how many iterations value iteration needs. The 1/(1γ)1/(1-\gamma) dependence means that long-horizon problems (large γ\gamma) are provably harder. Not just heuristically, but provably slower to solve.

Failure Mode

When γ\gamma is close to 1, convergence can be extremely slow. With γ=0.999\gamma = 0.999, you need roughly 1000 times more iterations than with γ=0.9\gamma = 0.9. For near-undiscounted problems, policy iteration (which does not depend on γ\gamma for its iteration count) can be vastly more efficient.

Policy Iteration

Policy iteration alternates two phases:

Phase 1. Policy Evaluation. Given the current policy πk\pi_k, compute VπkV^{\pi_k} exactly by solving the linear system:

Vπk=Rπk+γPπkVπkV^{\pi_k} = R^{\pi_k} + \gamma P^{\pi_k} V^{\pi_k}

which gives Vπk=(IγPπk)1RπkV^{\pi_k} = (I - \gamma P^{\pi_k})^{-1} R^{\pi_k}. This costs O(S3)O(|\mathcal{S}|^3) via direct matrix inversion or can be done iteratively.

Phase 2. Policy Improvement. Define the new policy greedily:

πk+1(s)=argmaxaAQπk(s,a)=argmaxa[R(s,a)+γsP(ss,a)Vπk(s)]\pi_{k+1}(s) = \arg\max_{a \in \mathcal{A}} Q^{\pi_k}(s, a) = \arg\max_a \left[ R(s,a) + \gamma \sum_{s'} P(s'|s,a) V^{\pi_k}(s') \right]

The policy improvement theorem (from the MDP topic) guarantees Vπk+1(s)Vπk(s)V^{\pi_{k+1}}(s) \geq V^{\pi_k}(s) for all ss.

Theorem

Policy Iteration Terminates in Finite Steps

Statement

Policy iteration terminates after at most AS|\mathcal{A}|^{|\mathcal{S}|} iterations, producing the optimal policy π\pi^*. In practice, convergence is typically much faster. often in O(SA)O(|\mathcal{S}| \cdot |\mathcal{A}|) or fewer iterations.

Intuition

There are at most AS|\mathcal{A}|^{|\mathcal{S}|} deterministic policies. The policy improvement theorem guarantees that each iteration strictly improves the value function (unless the policy is already optimal). Since no policy is ever revisited, the algorithm must terminate.

Proof Sketch

By the policy improvement theorem, Vπk+1(s)Vπk(s)V^{\pi_{k+1}}(s) \geq V^{\pi_k}(s) for all ss, with equality everywhere if and only if πk\pi_k is optimal. If πk\pi_k is not optimal, then Vπk+1(s)>Vπk(s)V^{\pi_{k+1}}(s) > V^{\pi_k}(s) for at least one state. Since value functions uniquely determine policies (in the non-degenerate case) and there are finitely many deterministic policies, the sequence of distinct policies is finite and must reach π\pi^*.

Why It Matters

The worst-case bound AS|\mathcal{A}|^{|\mathcal{S}|} is exponential but is almost never tight. Empirically, policy iteration converges in a small number of iterations (often fewer than 20), independent of the state space size. This makes policy iteration surprisingly efficient despite the per-iteration cost of solving a linear system.

Failure Mode

The per-iteration cost is O(S3)O(|\mathcal{S}|^3) for exact policy evaluation. For large state spaces, this is prohibitive, and approximate policy evaluation (e.g., a few steps of iterative evaluation) is used instead. This gives "modified policy iteration," which interpolates between value iteration and policy iteration.

Value Iteration vs. Policy Iteration

Value IterationPolicy Iteration
Per-iteration costO(S2A)O(\|\mathcal{S}\|^2 \|\mathcal{A}\|)O(S3+S2A)O(\|\mathcal{S}\|^3 + \|\mathcal{S}\|^2 \|\mathcal{A}\|)
Number of iterationsO(11γlog1ϵ)O(\frac{1}{1-\gamma} \log \frac{1}{\epsilon})At most AS\|\mathcal{A}\|^{\|\mathcal{S}\|}, typically very few
Depends on γ\gamma?Yes. slower as γ1\gamma \to 1Iteration count does not depend on γ\gamma
When to preferSmall state spaces, moderate γ\gammaModerate state spaces, γ\gamma close to 1

Modified policy iteration is the practical middle ground: instead of solving the linear system exactly, run mm steps of iterative policy evaluation (Bellman expectation operator) before improving. When m=1m = 1, this recovers value iteration. When m=m = \infty, this recovers policy iteration.

Canonical Examples

Example

Grid world: value iteration

Consider a 4x4 grid world with a goal state giving reward +1 and all other transitions giving reward 0, with γ=0.9\gamma = 0.9. Initialize V0(s)=0V_0(s) = 0 for all states. After one iteration, only the states adjacent to the goal have nonzero values (V1=0.9V_1 = 0.9 for neighbors). After kk iterations, the value propagates outward. states kk steps from the goal acquire nonzero values. After about 30 iterations (1/(10.9)log(1/ϵ)\approx 1/(1-0.9) \cdot \log(1/\epsilon)), the values have converged to within ϵ\epsilon.

Example

Grid world: policy iteration

Same grid world. Initialize with a random policy. After policy evaluation (solving a 16x16 linear system), the value function reveals which states are closer to the goal. Policy improvement makes every state point toward higher value. Typically, 2-3 iterations of policy iteration suffice, compared to 30+ for value iteration. Each iteration is more expensive but there are far fewer.

Common Confusions

Watch Out

Value iteration does not maintain an explicit policy

Value iteration updates value functions directly. There is no policy variable during the iterations. The policy is extracted only at the end by acting greedily with respect to the converged values. In contrast, policy iteration maintains an explicit policy at every step.

Watch Out

Policy iteration convergence is not geometric

Value iteration converges at geometric rate γ\gamma. Policy iteration does not have a rate in the same sense. It converges in a finite but potentially exponential number of steps. The practical observation that policy iteration converges in very few iterations is empirical, not a consequence of the contraction theorem.

Watch Out

Neither algorithm scales to large state spaces

Both algorithms require iterating over all states and actions. For an Atari game with 1060\sim 10^{60} possible screens, exact value iteration and policy iteration are impossible. This is why we need function approximation (deep Q-networks, policy gradient methods). The next topics in the RL sequence.

Summary

  • Value iteration: Vk+1(s)=maxa[R(s,a)+γsP(ss,a)Vk(s)]V_{k+1}(s) = \max_a [R(s,a) + \gamma \sum_{s'} P(s'|s,a) V_k(s')]
  • Converges at geometric rate γk\gamma^k by the contraction theorem
  • Policy iteration: evaluate πk\pi_k exactly, then improve greedily
  • Terminates in at most AS|\mathcal{A}|^{|\mathcal{S}|} steps, usually much fewer
  • Value iteration is cheaper per iteration; policy iteration needs fewer iterations
  • Modified policy iteration interpolates between the two extremes

Exercises

ExerciseCore

Problem

Consider an MDP with 3 states, 2 actions, and γ=0.5\gamma = 0.5. If V0(s)=0V_0(s) = 0 for all ss and Rmax=1R_{\max} = 1, how many iterations of value iteration are needed to guarantee VkV0.01\|V_k - V^*\|_\infty \leq 0.01?

ExerciseCore

Problem

Why does setting m=1m = 1 in modified policy iteration recover value iteration? Write out the update and show it matches.

ExerciseAdvanced

Problem

Prove that value iteration with γ=0.99\gamma = 0.99 requires at least 10 times more iterations than with γ=0.9\gamma = 0.9 to achieve the same accuracy, assuming the same MDP structure and RmaxR_{\max}.

Related Comparisons

References

Canonical:

  • Puterman, Markov Decision Processes (1994), Chapters 6.2-6.4
  • Sutton & Barto, Reinforcement Learning: An Introduction (2018), Chapter 4

Current:

  • Agarwal, Jiang, Kakade, Sun, Reinforcement Learning: Theory and Algorithms (2022), Chapter 1
  • Bertsekas, Dynamic Programming and Optimal Control (4th ed.), Volume II, Chapter 1

Next Topics

The natural next steps from value iteration and policy iteration:

Last reviewed: April 2026

Prerequisites

Foundations this topic depends on.

Builds on This

Next Topics