Value Iteration vs Policy Iteration. Convergence, Complexity, and When to Use Each

What Each Does

Both algorithms solve finite Markov decision processes with known transition probabilities and rewards. They find the optimal value function $V^*$ and optimal policy $\pi^*$ satisfying the Bellman optimality equation.

Value iteration repeatedly applies the Bellman optimality operator to the value function until convergence, then extracts the greedy policy.

Policy iteration starts with an arbitrary policy, computes its exact value function (policy evaluation), then improves the policy by acting greedily with respect to that value function (policy improvement). It alternates these steps until the policy stops changing.

Side-by-Side Algorithms

Definition

Value Iteration

Initialize $V_0(s) = 0$ for all states. Repeat for $k = 0, 1, 2, \ldots$ :

$V_{k+1}(s) = \max_{a \in \mathcal{A}} \left[R(s, a) + \gamma \sum_{s'} P(s'|s,a)\, V_k(s')\right]$

Stop when $\|V_{k+1} - V_k\|_\infty < \epsilon$ . Extract the policy:

$\pi^*(s) = \arg\max_{a} \left[R(s, a) + \gamma \sum_{s'} P(s'|s,a)\, V^*(s')\right]$

Definition

Policy Iteration

Initialize $\pi_0$ arbitrarily. Repeat for $k = 0, 1, 2, \ldots$ :

Evaluate: Solve for $V^{\pi_k}$ exactly:

$V^{\pi_k}(s) = R(s, \pi_k(s)) + \gamma \sum_{s'} P(s'|s, \pi_k(s))\, V^{\pi_k}(s')$

This is a system of $|\mathcal{S}|$ linear equations in $|\mathcal{S}|$ unknowns.

Improve: Set $\pi_{k+1}(s) = \arg\max_{a} \left[R(s, a) + \gamma \sum_{s'} P(s'|s,a)\, V^{\pi_k}(s')\right]$ .

Stop when $\pi_{k+1} = \pi_k$ .

Where Each Is Stronger

Value iteration wins on per-iteration cost

Each value iteration update costs $O(|\mathcal{S}|^2 |\mathcal{A}|)$ : for each state, compute the Bellman backup over all actions, which requires summing over successor states. There is no linear system to solve. The update is a simple max over one-step lookaheads.

Policy iteration wins on number of iterations

Policy iteration converges in at most $|\mathcal{A}|^{|\mathcal{S}|}$ iterations (the number of deterministic policies), but in practice it converges in very few iterations, often fewer than 10 even for large MDPs. Each policy improvement step makes a strict improvement unless the policy is already optimal. Value iteration, by contrast, may need many iterations because convergence is geometric with rate $\gamma$ .

Value iteration wins when the state space is large

When $|\mathcal{S}|$ is large, solving the $|\mathcal{S}| \times |\mathcal{S}|$ linear system in policy evaluation becomes expensive: $O(|\mathcal{S}|^3)$ for direct methods. Value iteration avoids this entirely, trading more iterations for cheaper per-iteration cost.

Where Each Fails

Value iteration fails when $\gamma$ is close to 1

The contraction rate of the Bellman optimality operator is $\gamma$ . For $\gamma = 0.99$ , achieving $\epsilon$ -accuracy requires roughly $\log(\epsilon^{-1}) / \log(\gamma^{-1}) \approx 100 \log(\epsilon^{-1})$ iterations. For $\gamma = 0.999$ , this becomes $\sim 1000 \log(\epsilon^{-1})$ . The closer $\gamma$ is to 1, the slower convergence gets.

Policy iteration fails on memory and linear algebra cost

Exact policy evaluation requires solving $(I - \gamma P^{\pi})V = R^{\pi}$ , a dense linear system of size $|\mathcal{S}|$ . For $|\mathcal{S}| = 10^6$ , storing the transition matrix alone requires $10^{12}$ entries. This is infeasible. In practice, iterative methods replace exact solves, blurring the line between the two algorithms.

Both fail for continuous or large state spaces

Both algorithms enumerate over all states explicitly. For continuous state spaces or state spaces larger than $\sim 10^6$ , neither is practical without function approximation. This is where approximate dynamic programming and deep RL take over.

Key Costs Compared

	Value Iteration	Policy Iteration
Per-iteration cost	$O(S^2 A)$	$O(S^3 + S^2 A)$
Number of iterations	$O(\log(1/\epsilon) / (1 - \gamma))$	Worst case $O(A^S)$ , typically 5 to 15
Convergence type	Geometric ( $\gamma$ -contraction)	Finite (policy space is finite)
Memory	$O(S)$ for $V$	$O(S^2)$ for transition matrix

where $S$ and $A$ denote the number of states and actions respectively.

Convergence Guarantees

Theorem

Value Iteration Convergence

Statement

The Bellman optimality operator $T$ defined by:

$TV(s) = \max_{a} \left[R(s,a) + \gamma \sum_{s'} P(s'|s,a) V(s')\right]$

$\|V_k - V^*\|_\infty \leq \gamma^k \|V_0 - V^*\|_\infty$

Intuition

Each application of $T$ brings the value function closer to $V^*$ by a factor of $\gamma$ . The discount factor acts as a contraction rate: future rewards are worth less, so errors in the value function are attenuated when propagated backward.

Failure Mode

When $\gamma = 1$ (undiscounted), $T$ is no longer a contraction, and value iteration may not converge. The undiscounted case requires special treatment (e.g., average-reward formulations).

report a correction →

Theorem

Policy Iteration Finite Convergence

Statement

Policy iteration terminates in at most $|\mathcal{A}|^{|\mathcal{S}|}$ steps, and the final policy $\pi^*$ is optimal. Moreover, each policy improvement step strictly increases $V^{\pi_k}(s)$ for at least one state $s$ , unless $\pi_k$ is already optimal.

Intuition

The set of deterministic policies is finite. Each improvement step produces a strictly better policy (or the optimal one). A strictly increasing sequence in a finite set must terminate.

Failure Mode

The worst-case bound $|\mathcal{A}|^{|\mathcal{S}|}$ is exponential and not tight. In practice, convergence is much faster. However, the per-iteration cost of exact policy evaluation dominates for large state spaces.

report a correction →

When a Researcher Would Use Each

Example

Small MDP with high discount factor

Use policy iteration. When $|\mathcal{S}|$ is small enough for exact policy evaluation (say, $< 10{,}000$ states) and $\gamma$ is close to 1, policy iteration converges in a handful of iterations while value iteration would need hundreds or thousands.

Example

Moderate MDP with sparse transitions

Use value iteration. If the transition matrix is sparse (each state transitions to few successors), the per-iteration cost of value iteration drops below $O(|\mathcal{S}|^2 |\mathcal{A}|)$ , while policy evaluation still requires a linear solve that does not exploit sparsity as easily.

Common Confusions

Watch Out

Modified policy iteration blurs the boundary

Modified policy iteration performs a fixed number of Bellman backups for policy evaluation instead of solving the linear system exactly. With $m$ backups, it interpolates between value iteration ( $m = 1$ ) and policy iteration ( $m = \infty$ ). In practice, $m = 10$ to $20$ often works best, making the two algorithms end points of a spectrum rather than distinct choices.