What Each Does
Both algorithms solve finite Markov decision processes with known transition probabilities and rewards. They find the optimal value function and optimal policy satisfying the Bellman optimality equation.
Value iteration repeatedly applies the Bellman optimality operator to the value function until convergence, then extracts the greedy policy.
Policy iteration starts with an arbitrary policy, computes its exact value function (policy evaluation), then improves the policy by acting greedily with respect to that value function (policy improvement). It alternates these steps until the policy stops changing.
Side-by-Side Algorithms
Value Iteration
Initialize for all states. Repeat for :
Stop when . Extract the policy:
Policy Iteration
Initialize arbitrarily. Repeat for :
Evaluate: Solve for exactly:
This is a system of linear equations in unknowns.
Improve: Set .
Stop when .
Where Each Is Stronger
Value iteration wins on per-iteration cost
Each value iteration update costs : for each state, compute the Bellman backup over all actions, which requires summing over successor states. There is no linear system to solve. The update is a simple max over one-step lookaheads.
Policy iteration wins on number of iterations
Policy iteration converges in at most iterations (the number of deterministic policies), but in practice it converges in very few iterations, often fewer than 10 even for large MDPs. Each policy improvement step makes a strict improvement unless the policy is already optimal. Value iteration, by contrast, may need many iterations because convergence is geometric with rate .
Value iteration wins when the state space is large
When is large, solving the linear system in policy evaluation becomes expensive: for direct methods. Value iteration avoids this entirely, trading more iterations for cheaper per-iteration cost.
Where Each Fails
Value iteration fails when is close to 1
The contraction rate of the Bellman optimality operator is . For , achieving -accuracy requires roughly iterations. For , this becomes . The closer is to 1, the slower convergence gets.
Policy iteration fails on memory and linear algebra cost
Exact policy evaluation requires solving , a dense linear system of size . For , storing the transition matrix alone requires entries. This is infeasible. In practice, iterative methods replace exact solves, blurring the line between the two algorithms.
Both fail for continuous or large state spaces
Both algorithms enumerate over all states explicitly. For continuous state spaces or state spaces larger than , neither is practical without function approximation. This is where approximate dynamic programming and deep RL take over.
Key Costs Compared
| Value Iteration | Policy Iteration | |
|---|---|---|
| Per-iteration cost | ||
| Number of iterations | Worst case , typically 5 to 15 | |
| Convergence type | Geometric (-contraction) | Finite (policy space is finite) |
| Memory | for | for transition matrix |
where and denote the number of states and actions respectively.
Convergence Guarantees
Value Iteration Convergence
Statement
The Bellman optimality operator defined by:
is a -contraction in . That is, for all . By the Banach fixed point theorem, geometrically:
Intuition
Each application of brings the value function closer to by a factor of . The discount factor acts as a contraction rate: future rewards are worth less, so errors in the value function are attenuated when propagated backward.
Failure Mode
When (undiscounted), is no longer a contraction, and value iteration may not converge. The undiscounted case requires special treatment (e.g., average-reward formulations).
Policy Iteration Finite Convergence
Statement
Policy iteration terminates in at most steps, and the final policy is optimal. Moreover, each policy improvement step strictly increases for at least one state , unless is already optimal.
Intuition
The set of deterministic policies is finite. Each improvement step produces a strictly better policy (or the optimal one). A strictly increasing sequence in a finite set must terminate.
Failure Mode
The worst-case bound is exponential and not tight. In practice, convergence is much faster. However, the per-iteration cost of exact policy evaluation dominates for large state spaces.
When a Researcher Would Use Each
Small MDP with high discount factor
Use policy iteration. When is small enough for exact policy evaluation (say, states) and is close to 1, policy iteration converges in a handful of iterations while value iteration would need hundreds or thousands.
Moderate MDP with sparse transitions
Use value iteration. If the transition matrix is sparse (each state transitions to few successors), the per-iteration cost of value iteration drops below , while policy evaluation still requires a linear solve that does not exploit sparsity as easily.
Common Confusions
Modified policy iteration blurs the boundary
Modified policy iteration performs a fixed number of Bellman backups for policy evaluation instead of solving the linear system exactly. With backups, it interpolates between value iteration () and policy iteration (). In practice, to often works best, making the two algorithms end points of a spectrum rather than distinct choices.
Policy iteration does not always converge faster in wall-clock time
Policy iteration needs fewer iterations, but each iteration is more expensive. For large sparse MDPs, value iteration with many cheap iterations can finish faster in wall-clock time than policy iteration with few expensive iterations. The right comparison is total compute, not iteration count.
References
Canonical:
- Puterman, Markov Decision Processes: Discrete Stochastic Dynamic Programming (2005), Chapters 6.2-6.4
- Sutton & Barto, Reinforcement Learning: An Introduction (2018), Chapter 4
Current:
- Bertsekas, Dynamic Programming and Optimal Control, Vol. II (2012), Chapter 2