RL Theory
Value Iteration and Policy Iteration
The two foundational algorithms for solving MDPs exactly: value iteration applies the Bellman optimality operator until convergence, while policy iteration alternates between exact evaluation and greedy improvement.
Prerequisites
Why This Matters
Every tabular RL algorithm is either value iteration, policy iteration, or an approximation of one. These are the two ways to turn Bellman equations into algorithms. If you understand MDPs but cannot solve them, the theory is inert. Value iteration and policy iteration are the engines.
Beyond tabular settings, modern deep RL algorithms inherit the structure of these classical methods. DQN is approximate value iteration. PPO is approximate policy iteration. Understanding the exact algorithms clarifies what the approximate versions are trying to do and why they sometimes fail.
Mental Model
Value iteration: Start with any guess for the value function. Repeatedly ask, "If these values were correct, what would the best action be, and what value would that give?" Update the values accordingly. The Bellman contraction theorem guarantees this converges to the true optimal values at a geometric rate.
Policy iteration: Start with any policy. First, compute its exact value function (solve a linear system). Then improve the policy by acting greedily with respect to those values. Repeat. Each step strictly improves the policy until optimality is reached.
Formal Setup and Notation
We work in a finite MDP with states and actions. All definitions of , , , and follow from the MDP topic.
Bellman Optimality Operator
The Bellman optimality operator acts on value functions:
The optimal value function is the unique fixed point: .
Bellman Expectation Operator
For a fixed policy , the Bellman expectation operator is:
The value function is the unique fixed point: .
Greedy Policy
Given a value function , the greedy policy is:
This is the policy that would be optimal if were the true optimal value function.
Value Iteration
The value iteration algorithm initializes arbitrarily (often to zero) and iterates:
That is, . After convergence, extract the optimal policy: .
Each iteration costs . for each state and action, we sum over all next states.
Main Theorems
Value Iteration Converges Geometrically
Statement
For any initial , the value iteration sequence satisfies:
Therefore, to achieve , it suffices to run:
iterations, where and we use .
Intuition
The Bellman operator is a -contraction in the norm. Each application shrinks the error by a factor of . Geometric convergence is the direct consequence of the Banach fixed point theorem. The closer is to 1 (more patient agents), the slower convergence becomes.
Proof Sketch
From the contraction property (proved in the MDP topic), apply inductively:
For the iteration bound, set and solve for , using .
Why It Matters
This tells you exactly how many iterations value iteration needs. The dependence means that long-horizon problems (large ) are provably harder. Not just heuristically, but provably slower to solve.
Failure Mode
When is close to 1, convergence can be extremely slow. With , you need roughly 1000 times more iterations than with . For near-undiscounted problems, policy iteration (which does not depend on for its iteration count) can be vastly more efficient.
Policy Iteration
Policy iteration alternates two phases:
Phase 1. Policy Evaluation. Given the current policy , compute exactly by solving the linear system:
which gives . This costs via direct matrix inversion or can be done iteratively.
Phase 2. Policy Improvement. Define the new policy greedily:
The policy improvement theorem (from the MDP topic) guarantees for all .
Policy Iteration Terminates in Finite Steps
Statement
Policy iteration terminates after at most iterations, producing the optimal policy . In practice, convergence is typically much faster. often in or fewer iterations.
Intuition
There are at most deterministic policies. The policy improvement theorem guarantees that each iteration strictly improves the value function (unless the policy is already optimal). Since no policy is ever revisited, the algorithm must terminate.
Proof Sketch
By the policy improvement theorem, for all , with equality everywhere if and only if is optimal. If is not optimal, then for at least one state. Since value functions uniquely determine policies (in the non-degenerate case) and there are finitely many deterministic policies, the sequence of distinct policies is finite and must reach .
Why It Matters
The worst-case bound is exponential but is almost never tight. Empirically, policy iteration converges in a small number of iterations (often fewer than 20), independent of the state space size. This makes policy iteration surprisingly efficient despite the per-iteration cost of solving a linear system.
Failure Mode
The per-iteration cost is for exact policy evaluation. For large state spaces, this is prohibitive, and approximate policy evaluation (e.g., a few steps of iterative evaluation) is used instead. This gives "modified policy iteration," which interpolates between value iteration and policy iteration.
Value Iteration vs. Policy Iteration
| Value Iteration | Policy Iteration | |
|---|---|---|
| Per-iteration cost | ||
| Number of iterations | At most , typically very few | |
| Depends on ? | Yes. slower as | Iteration count does not depend on |
| When to prefer | Small state spaces, moderate | Moderate state spaces, close to 1 |
Modified policy iteration is the practical middle ground: instead of solving the linear system exactly, run steps of iterative policy evaluation (Bellman expectation operator) before improving. When , this recovers value iteration. When , this recovers policy iteration.
Canonical Examples
Grid world: value iteration
Consider a 4x4 grid world with a goal state giving reward +1 and all other transitions giving reward 0, with . Initialize for all states. After one iteration, only the states adjacent to the goal have nonzero values ( for neighbors). After iterations, the value propagates outward. states steps from the goal acquire nonzero values. After about 30 iterations (), the values have converged to within .
Grid world: policy iteration
Same grid world. Initialize with a random policy. After policy evaluation (solving a 16x16 linear system), the value function reveals which states are closer to the goal. Policy improvement makes every state point toward higher value. Typically, 2-3 iterations of policy iteration suffice, compared to 30+ for value iteration. Each iteration is more expensive but there are far fewer.
Common Confusions
Value iteration does not maintain an explicit policy
Value iteration updates value functions directly. There is no policy variable during the iterations. The policy is extracted only at the end by acting greedily with respect to the converged values. In contrast, policy iteration maintains an explicit policy at every step.
Policy iteration convergence is not geometric
Value iteration converges at geometric rate . Policy iteration does not have a rate in the same sense. It converges in a finite but potentially exponential number of steps. The practical observation that policy iteration converges in very few iterations is empirical, not a consequence of the contraction theorem.
Neither algorithm scales to large state spaces
Both algorithms require iterating over all states and actions. For an Atari game with possible screens, exact value iteration and policy iteration are impossible. This is why we need function approximation (deep Q-networks, policy gradient methods). The next topics in the RL sequence.
Summary
- Value iteration:
- Converges at geometric rate by the contraction theorem
- Policy iteration: evaluate exactly, then improve greedily
- Terminates in at most steps, usually much fewer
- Value iteration is cheaper per iteration; policy iteration needs fewer iterations
- Modified policy iteration interpolates between the two extremes
Exercises
Problem
Consider an MDP with 3 states, 2 actions, and . If for all and , how many iterations of value iteration are needed to guarantee ?
Problem
Why does setting in modified policy iteration recover value iteration? Write out the update and show it matches.
Problem
Prove that value iteration with requires at least 10 times more iterations than with to achieve the same accuracy, assuming the same MDP structure and .
Related Comparisons
References
Canonical:
- Puterman, Markov Decision Processes (1994), Chapters 6.2-6.4
- Sutton & Barto, Reinforcement Learning: An Introduction (2018), Chapter 4
Current:
- Agarwal, Jiang, Kakade, Sun, Reinforcement Learning: Theory and Algorithms (2022), Chapter 1
- Bertsekas, Dynamic Programming and Optimal Control (4th ed.), Volume II, Chapter 1
Next Topics
The natural next steps from value iteration and policy iteration:
- Q-learning: model-free value iteration using sampled transitions
- Policy gradient theorem: optimizing parameterized policies via gradient ascent
Last reviewed: April 2026
Prerequisites
Foundations this topic depends on.
- Markov Decision ProcessesLayer 2
- Convex Optimization BasicsLayer 1
- Differentiation in RnLayer 0A
- Sets, Functions, and RelationsLayer 0A
- Basic Logic and Proof TechniquesLayer 0A
- Matrix Operations and PropertiesLayer 0A
- Concentration InequalitiesLayer 1
- Common Probability DistributionsLayer 0A
- Expectation, Variance, Covariance, and MomentsLayer 0A
Builds on This
- Options and Temporal AbstractionLayer 3
- Q-LearningLayer 2
- Temporal Difference LearningLayer 2