Beta. Content is under active construction and has not been peer-reviewed. Report errors on GitHub.Disclaimer

Comparison

Value Iteration vs. Policy Iteration

Both algorithms find optimal policies for finite MDPs. Value iteration applies the Bellman optimality operator repeatedly and extracts the policy at the end. Policy iteration alternates between full policy evaluation and greedy improvement, converging in fewer iterations but with more work per iteration.

What Each Does

Both algorithms solve finite Markov decision processes with known transition probabilities and rewards. They find the optimal value function VV^* and optimal policy π\pi^* satisfying the Bellman optimality equation.

Value iteration repeatedly applies the Bellman optimality operator to the value function until convergence, then extracts the greedy policy.

Policy iteration starts with an arbitrary policy, computes its exact value function (policy evaluation), then improves the policy by acting greedily with respect to that value function (policy improvement). It alternates these steps until the policy stops changing.

Side-by-Side Algorithms

Definition

Value Iteration

Initialize V0(s)=0V_0(s) = 0 for all states. Repeat for k=0,1,2,k = 0, 1, 2, \ldots:

Vk+1(s)=maxaA[R(s,a)+γsP(ss,a)Vk(s)]V_{k+1}(s) = \max_{a \in \mathcal{A}} \left[R(s, a) + \gamma \sum_{s'} P(s'|s,a)\, V_k(s')\right]

Stop when Vk+1Vk<ϵ\|V_{k+1} - V_k\|_\infty < \epsilon. Extract the policy:

π(s)=argmaxa[R(s,a)+γsP(ss,a)V(s)]\pi^*(s) = \arg\max_{a} \left[R(s, a) + \gamma \sum_{s'} P(s'|s,a)\, V^*(s')\right]

Definition

Policy Iteration

Initialize π0\pi_0 arbitrarily. Repeat for k=0,1,2,k = 0, 1, 2, \ldots:

Evaluate: Solve for VπkV^{\pi_k} exactly:

Vπk(s)=R(s,πk(s))+γsP(ss,πk(s))Vπk(s)V^{\pi_k}(s) = R(s, \pi_k(s)) + \gamma \sum_{s'} P(s'|s, \pi_k(s))\, V^{\pi_k}(s')

This is a system of S|\mathcal{S}| linear equations in S|\mathcal{S}| unknowns.

Improve: Set πk+1(s)=argmaxa[R(s,a)+γsP(ss,a)Vπk(s)]\pi_{k+1}(s) = \arg\max_{a} \left[R(s, a) + \gamma \sum_{s'} P(s'|s,a)\, V^{\pi_k}(s')\right].

Stop when πk+1=πk\pi_{k+1} = \pi_k.

Where Each Is Stronger

Value iteration wins on per-iteration cost

Each value iteration update costs O(S2A)O(|\mathcal{S}|^2 |\mathcal{A}|): for each state, compute the Bellman backup over all actions, which requires summing over successor states. There is no linear system to solve. The update is a simple max over one-step lookaheads.

Policy iteration wins on number of iterations

Policy iteration converges in at most AS|\mathcal{A}|^{|\mathcal{S}|} iterations (the number of deterministic policies), but in practice it converges in very few iterations, often fewer than 10 even for large MDPs. Each policy improvement step makes a strict improvement unless the policy is already optimal. Value iteration, by contrast, may need many iterations because convergence is geometric with rate γ\gamma.

Value iteration wins when the state space is large

When S|\mathcal{S}| is large, solving the S×S|\mathcal{S}| \times |\mathcal{S}| linear system in policy evaluation becomes expensive: O(S3)O(|\mathcal{S}|^3) for direct methods. Value iteration avoids this entirely, trading more iterations for cheaper per-iteration cost.

Where Each Fails

Value iteration fails when γ\gamma is close to 1

The contraction rate of the Bellman optimality operator is γ\gamma. For γ=0.99\gamma = 0.99, achieving ϵ\epsilon-accuracy requires roughly log(ϵ1)/log(γ1)100log(ϵ1)\log(\epsilon^{-1}) / \log(\gamma^{-1}) \approx 100 \log(\epsilon^{-1}) iterations. For γ=0.999\gamma = 0.999, this becomes 1000log(ϵ1)\sim 1000 \log(\epsilon^{-1}). The closer γ\gamma is to 1, the slower convergence gets.

Policy iteration fails on memory and linear algebra cost

Exact policy evaluation requires solving (IγPπ)V=Rπ(I - \gamma P^{\pi})V = R^{\pi}, a dense linear system of size S|\mathcal{S}|. For S=106|\mathcal{S}| = 10^6, storing the transition matrix alone requires 101210^{12} entries. This is infeasible. In practice, iterative methods replace exact solves, blurring the line between the two algorithms.

Both fail for continuous or large state spaces

Both algorithms enumerate over all states explicitly. For continuous state spaces or state spaces larger than 106\sim 10^6, neither is practical without function approximation. This is where approximate dynamic programming and deep RL take over.

Key Costs Compared

Value IterationPolicy Iteration
Per-iteration costO(S2A)O(S^2 A)O(S3+S2A)O(S^3 + S^2 A)
Number of iterationsO(log(1/ϵ)/(1γ))O(\log(1/\epsilon) / (1 - \gamma))Worst case O(AS)O(A^S), typically 5 to 15
Convergence typeGeometric (γ\gamma-contraction)Finite (policy space is finite)
MemoryO(S)O(S) for VVO(S2)O(S^2) for transition matrix

where SS and AA denote the number of states and actions respectively.

Convergence Guarantees

Theorem

Value Iteration Convergence

Statement

The Bellman optimality operator TT defined by:

TV(s)=maxa[R(s,a)+γsP(ss,a)V(s)]TV(s) = \max_{a} \left[R(s,a) + \gamma \sum_{s'} P(s'|s,a) V(s')\right]

is a γ\gamma-contraction in \|\cdot\|_\infty. That is, TVTWγVW\|TV - TW\|_\infty \leq \gamma \|V - W\|_\infty for all V,WV, W. By the Banach fixed point theorem, VkVV_k \to V^* geometrically:

VkVγkV0V\|V_k - V^*\|_\infty \leq \gamma^k \|V_0 - V^*\|_\infty

Intuition

Each application of TT brings the value function closer to VV^* by a factor of γ\gamma. The discount factor acts as a contraction rate: future rewards are worth less, so errors in the value function are attenuated when propagated backward.

Failure Mode

When γ=1\gamma = 1 (undiscounted), TT is no longer a contraction, and value iteration may not converge. The undiscounted case requires special treatment (e.g., average-reward formulations).

Theorem

Policy Iteration Finite Convergence

Statement

Policy iteration terminates in at most AS|\mathcal{A}|^{|\mathcal{S}|} steps, and the final policy π\pi^* is optimal. Moreover, each policy improvement step strictly increases Vπk(s)V^{\pi_k}(s) for at least one state ss, unless πk\pi_k is already optimal.

Intuition

The set of deterministic policies is finite. Each improvement step produces a strictly better policy (or the optimal one). A strictly increasing sequence in a finite set must terminate.

Failure Mode

The worst-case bound AS|\mathcal{A}|^{|\mathcal{S}|} is exponential and not tight. In practice, convergence is much faster. However, the per-iteration cost of exact policy evaluation dominates for large state spaces.

When a Researcher Would Use Each

Example

Small MDP with high discount factor

Use policy iteration. When S|\mathcal{S}| is small enough for exact policy evaluation (say, <10,000< 10{,}000 states) and γ\gamma is close to 1, policy iteration converges in a handful of iterations while value iteration would need hundreds or thousands.

Example

Moderate MDP with sparse transitions

Use value iteration. If the transition matrix is sparse (each state transitions to few successors), the per-iteration cost of value iteration drops below O(S2A)O(|\mathcal{S}|^2 |\mathcal{A}|), while policy evaluation still requires a linear solve that does not exploit sparsity as easily.

Common Confusions

Watch Out

Modified policy iteration blurs the boundary

Modified policy iteration performs a fixed number of Bellman backups for policy evaluation instead of solving the linear system exactly. With mm backups, it interpolates between value iteration (m=1m = 1) and policy iteration (m=m = \infty). In practice, m=10m = 10 to 2020 often works best, making the two algorithms end points of a spectrum rather than distinct choices.

Watch Out

Policy iteration does not always converge faster in wall-clock time

Policy iteration needs fewer iterations, but each iteration is more expensive. For large sparse MDPs, value iteration with many cheap iterations can finish faster in wall-clock time than policy iteration with few expensive iterations. The right comparison is total compute, not iteration count.

References

Canonical:

Current: