Markov Decision Processes

Sneiderman, Robby

RL Theory

Markov Decision Processes

The mathematical framework for sequential decision-making under uncertainty: states, actions, transitions, rewards, and the Bellman equations that make solving them possible.

CoreTier 1StableSupporting~70 min

Prerequisites

Convex Optimization Basics Concentration Inequalities Bayesian State Estimation Markov Chains and Steady State

Start 8-question practice · 3 available Prereq Map

Learning position

Read this page in the graph.

rl-theory | layer 2 | tier 1. This page has 6 direct prerequisites and 23 published dependents.

Open Atlas Prerequisites Leads to

What next

Policy Gradient Theorem

This is the first curated or graph-derived continuation from the current page.

Evidence badge

Claim status

This page has no public Lean mapping yet. Use the evidence page to inspect how claim status labels work.

Show the backing system

AtlasOpen the full prerequisite graph and run grounding traces.EvidenceInspect source support, claim labels, and public trust status.LeanReview the checked declaration list, scopes, and axiom profile.

Why This Matters

Every reinforcement learning algorithm is either solving an MDP directly or approximating a solution to one. When AlphaGo plays a move, it is approximately solving an MDP. When a robot learns to walk, the problem is formulated as an MDP. When a language model is fine-tuned with RLHF, the training loop treats token generation as an MDP.

MDPs provide the mathematical language for sequential decision-making. Without understanding MDPs, reinforcement learning is just heuristics.

Mental Model

You are an agent in an environment. At each time step, you observe a state, take an action, receive a reward, and transition to a new state. Your goal is to choose actions that maximize the total reward you accumulate over time. The transition to the next state depends only on the current state and action, not on the history. This is the Markov property.

The Bellman equations express a recursive structure: the value of being in a state equals the immediate reward plus the discounted value of the next state. This recursion is the engine behind every dynamic programming and RL algorithm.

Formal Setup and Notation

Definition

Markov Decision Process $(S, A, P, R, r h o_{0}, g amma)$

A Markov decision process is a tuple $(\mathcal{S}, \mathcal{A}, P, R, \rho_0, \gamma)$ where:

$\mathcal{S}$ is a finite set of states
$\mathcal{A}$ is a finite set of actions
$P: \mathcal{S} \times \mathcal{A} \times \mathcal{S} \to [0,1]$ is the transition function, where $P(s' | s, a)$ is the probability of transitioning to state $s'$ when taking action $a$ in state $s$
$R: \mathcal{S} \times \mathcal{A} \to \mathbb{R}$ is the reward function, where $R(s, a)$ is the expected immediate reward for taking action $a$ in state $s$
$\rho_0: \mathcal{S} \to [0,1]$ is the initial state distribution, so $s_0 \sim \rho_0$
$\gamma \in [0, 1)$ is the discount factor

Conventions vary. Sutton-Barto omit $\rho_0$ and state the starting state separately. Some texts use the 3-argument reward $R(s, a, s')$ instead of $R(s, a)$ . Without $\rho_0$ , the policy objective $J(\pi) = \mathbb{E}_{\tau \sim \pi}[R(\tau)]$ is underdetermined because the expectation is over initial state as well as transitions.

Definition

Policy $p i$

A policy $\pi: \mathcal{S} \to \Delta(\mathcal{A})$ maps each state to a probability distribution over actions. A deterministic policy maps each state to a single action: $\pi: \mathcal{S} \to \mathcal{A}$ .

Definition

Return $G_{t}$

The return from time step $t$ is the discounted sum of future rewards:

$G_t = \sum_{k=0}^{\infty} \gamma^k r_{t+k}$

The discount factor $\gamma < 1$ ensures this sum is finite when rewards are bounded.

Core Definitions

Definition

State-Value Function $V^{p} i (s)$

The state-value function for a policy $\pi$ gives the expected return starting from state $s$ and following $\pi$ thereafter:

$V^\pi(s) = \mathbb{E}_\pi\left[\sum_{k=0}^{\infty} \gamma^k r_{t+k} \;\middle|\; s_t = s\right]$

Definition

Action-Value Function $Q^{p} i (s, a)$

The action-value function for a policy $\pi$ gives the expected return starting from state $s$ , taking action $a$ , and following $\pi$ thereafter:

$Q^\pi(s, a) = \mathbb{E}_\pi\left[\sum_{k=0}^{\infty} \gamma^k r_{t+k} \;\middle|\; s_t = s, a_t = a\right]$

The relationship between $V$ and $Q$ is:

$V^\pi(s) = \sum_{a \in \mathcal{A}} \pi(a|s) \, Q^\pi(s, a)$

Main Theorems

Theorem

Bellman Expectation Equation

Statement

For any policy $\pi$ , the value function $V^\pi$ satisfies:

$V^\pi(s) = \sum_{a} \pi(a|s) \left[ R(s,a) + \gamma \sum_{s'} P(s'|s,a) \, V^\pi(s') \right]$

In matrix form, writing $R^\pi$ for the expected reward vector under $\pi$ and $P^\pi$ for the transition matrix under $\pi$ :

$V^\pi = R^\pi + \gamma P^\pi V^\pi$

Intuition

The value of a state is the expected immediate reward plus the discounted expected value of the next state. This recursion expresses the idea that the future looks the same from any point in a Markov process. SO we can break the infinite-horizon problem into one step plus the rest.

Proof Sketch

Expand the definition of $V^\pi(s)$ , split the sum into the first reward and the remaining terms, and use the Markov property to replace the conditional expectation of future returns with $V^\pi(s')$ .

Why It Matters

The Bellman expectation equation is a system of $|\mathcal{S}|$ linear equations in $|\mathcal{S}|$ unknowns. For a fixed policy, the value function can be computed exactly by solving $V^\pi = (I - \gamma P^\pi)^{-1} R^\pi$ . This is the basis of policy evaluation.

report a correction →

Theorem

Bellman Optimality Equation

Statement

The optimal value function $V^*$ satisfies:

$V^*(s) = \max_{a \in \mathcal{A}} \left[ R(s,a) + \gamma \sum_{s'} P(s'|s,a) \, V^*(s') \right]$

Equivalently, the optimal action-value function satisfies:

$Q^*(s,a) = R(s,a) + \gamma \sum_{s'} P(s'|s,a) \max_{a'} Q^*(s', a')$

Intuition

The optimal value of a state is obtained by choosing the action that maximizes the immediate reward plus discounted future value. The max replaces the expectation over the policy because the optimal agent always picks the best action.

Proof Sketch

Define $V^*(s) = \sup_\pi V^\pi(s)$ . Show that any policy that is greedy with respect to $V^*$ achieves $V^*$ . The max arises because the optimal policy is deterministic for finite MDPs (for any optimal value function, a greedy deterministic policy is optimal).

Why It Matters

Unlike the Bellman expectation equation, the Bellman optimality equation is nonlinear (because of the max). It cannot be solved by matrix inversion. Instead, we solve it iteratively via value iteration or policy iteration.

report a correction →

Theorem

Bellman Operator is a Contraction

Statement

Define the Bellman optimality operator $\mathcal{T}: \mathbb{R}^{|\mathcal{S}|} \to \mathbb{R}^{|\mathcal{S}|}$ by:

$(\mathcal{T}V)(s) = \max_{a} \left[ R(s,a) + \gamma \sum_{s'} P(s'|s,a) \, V(s') \right]$

Then $\mathcal{T}$ is a $\gamma$ -contraction in the $\ell^\infty$ norm:

$\|\mathcal{T}V - \mathcal{T}U\|_\infty \leq \gamma \|V - U\|_\infty$

By the Banach fixed point theorem, $\mathcal{T}$ has a unique fixed point $V^*$ , and value iteration $V_{k+1} = \mathcal{T}V_k$ converges to $V^*$ at rate $\gamma^k$ .

Intuition

Applying the Bellman update brings any value estimate closer to the true optimal value. The discount factor $\gamma < 1$ is doing the heavy lifting: each application of $\mathcal{T}$ shrinks errors by a factor of $\gamma$ .

Proof Sketch

For any two value functions $V, U$ and any state $s$ , let $a^*$ achieve the max for $V$ . Then:

$(\mathcal{T}V)(s) - (\mathcal{T}U)(s) \leq \gamma \sum_{s'} P(s'|s,a^*)(V(s') - U(s')) \leq \gamma \|V - U\|_\infty$

By symmetry the same bound holds with $V$ and $U$ swapped, giving the result.

Why It Matters

This theorem guarantees that value iteration converges, and tells us the rate. After $k$ iterations, $\|V_k - V^*\|_\infty \leq \gamma^k \|V_0 - V^*\|_\infty$ . The contraction property is the reason dynamic programming works.

Failure Mode

If $\gamma = 1$ (undiscounted), the operator is not a contraction and value iteration may not converge. The undiscounted case requires additional structure (e.g., all policies reach a terminal state).

report a correction →

Theorem

Policy Improvement Theorem

Statement

Let $\pi$ be a policy and define the greedy policy $\pi'$ by:

$\pi'(s) = \arg\max_{a} Q^\pi(s, a)$

Then $V^{\pi'}(s) \geq V^\pi(s)$ for all $s \in \mathcal{S}$ , with equality for all $s$ if and only if $\pi$ is already optimal.

Intuition

If you evaluate a policy, then act greedily with respect to the resulting value function, you can only do better. This is the engine behind policy iteration: evaluate, improve, repeat.

Proof Sketch

Start from $V^\pi(s) \leq \max_a Q^\pi(s,a) = Q^\pi(s, \pi'(s))$ . Expand the right side using the Bellman equation and iterate, obtaining $V^\pi(s) \leq V^{\pi'}(s)$ .

Why It Matters

Combined with policy evaluation, this gives the policy iteration algorithm: evaluate $\pi$ exactly, improve to $\pi'$ , and repeat. Since there are finitely many deterministic policies and each step improves the value, policy iteration terminates at the optimal policy in finite steps.

report a correction →

Algorithms from the Theory

Policy Evaluation. Given a fixed policy $\pi$ , solve $V^\pi = R^\pi + \gamma P^\pi V^\pi$ either by matrix inversion (exact, $O(|\mathcal{S}|^3)$ ) or by iterative application of the Bellman expectation operator (converges at rate $\gamma^k$ ).

Policy Iteration. Alternate between:

Evaluate: Compute $V^\pi$ exactly
Improve: Set $\pi'(s) = \arg\max_a Q^\pi(s,a)$ for all $s$
Repeat until $\pi' = \pi$

Policy iteration converges in at most $|\mathcal{A}|^{|\mathcal{S}|}$ iterations (the number of deterministic policies), but in practice converges much faster.

Value Iteration. Repeatedly apply the Bellman optimality operator: $V_{k+1}(s) = \max_a [R(s,a) + \gamma \sum_{s'} P(s'|s,a) V_k(s')]$ . This converges to $V^*$ by the contraction theorem. Extract the optimal policy at the end: $\pi^*(s) = \arg\max_a Q^*(s,a)$ .

Connection to Dynamic Programming

MDPs are the probabilistic generalization of deterministic dynamic programming. In a deterministic system, transitions are certain: $P(s'|s,a) = 1$ for a single $s'$ . The Bellman optimality equation reduces to:

$V^*(s) = \max_a [R(s,a) + \gamma V^*(f(s,a))]$

where $f(s,a)$ is the deterministic next state. This is exactly the recursive structure exploited in classical DP algorithms (shortest paths, sequence alignment, etc.).

Common Confusions

Watch Out

MDP vs. bandit

A multi-armed bandit is an MDP with a single state. There are no transitions and no sequential structure. The entire difficulty of MDPs comes from the fact that actions affect future states. In bandits, the challenge is purely exploration vs. exploitation within a single decision.

Watch Out

Value iteration vs. policy iteration

Value iteration updates the value function directly using the Bellman optimality operator. Policy iteration alternates between exact policy evaluation and greedy improvement. Policy iteration often converges in fewer iterations (each iteration is more expensive), but value iteration avoids solving a linear system at each step. For large state spaces, both are replaced by approximate methods.

Watch Out

State vs observation

A fully-observed MDP gives the agent the true state $s$ at each step. A partially-observed MDP (POMDP) gives only an observation $o$ that carries less information than $s$ . Most of the modern deep RL literature writes "state" even when the agent really sees an observation (the state is a latent thing the agent must infer). The framework is the same (an agent observes something, acts, transitions, receives reward), but proofs that assume access to the true state do not automatically carry over to the POMDP setting.

Watch Out

Discount factor as a modeling choice vs. technical necessity

Students often ask why we need $\gamma < 1$ . There are two reasons: (1) it models the preference for sooner rewards, and (2) it is technically necessary for the Bellman operator to be a contraction. Without discounting, value functions can be infinite and convergence guarantees break.

Summary

An MDP is defined by $(\mathcal{S}, \mathcal{A}, P, R, \rho_0, \gamma)$ ; the initial-state distribution $\rho_0$ is often left implicit but is needed to define the policy objective
$V^\pi(s)$ is the expected discounted return under policy $\pi$ from state $s$
The Bellman expectation equation is linear: $V^\pi = R^\pi + \gamma P^\pi V^\pi$
The Bellman optimality equation is nonlinear: $V^*(s) = \max_a [R(s,a) + \gamma \sum_{s'} P(s'|s,a) V^*(s')]$
The Bellman optimality operator is a $\gamma$ -contraction, so value iteration converges
Policy iteration: evaluate exactly, then improve greedily. guaranteed to find optimal policy

Exercises

ExerciseCore

Problem

Consider an MDP with two states $\{s_1, s_2\}$ , one action $\{a\}$ , transitions $P(s_1|s_1,a) = 0.5$ , $P(s_2|s_1,a) = 0.5$ , $P(s_2|s_2,a) = 1$ , rewards $R(s_1,a) = 1$ , $R(s_2,a) = 0$ , and $\gamma = 0.9$ . Compute $V^\pi(s_1)$ and $V^\pi(s_2)$ .

ExerciseCore

Problem

Prove that the Bellman expectation operator $\mathcal{T}^\pi V = R^\pi + \gamma P^\pi V$ is also a $\gamma$ -contraction in the $\ell^\infty$ norm.

ExerciseAdvanced

Problem

Show that policy iteration terminates in a finite number of steps for any finite MDP with $\gamma < 1$ .

Related Comparisons

Self-Play vs. Independent Learning

References

Pre-canonical (foundational):

Bellman, Dynamic Programming (1957). The principle of optimality and the original Bellman recursion; the direct source for $V^\star$ and value iteration.
Howard, Dynamic Programming and Markov Processes (1960). Introduces policy iteration (evaluate then improve) and the modern MDP formulation.
Blackwell, "Discrete Dynamic Programming" (1962), Ann. Math. Statist. 33(2). Rigorous treatment of discounted infinite-horizon optimality and the existence of deterministic optimal policies.

Canonical:

Puterman, Markov Decision Processes (1994), Chapters 2, 6
Sutton & Barto, Reinforcement Learning: An Introduction (2018), Chapters 3-4

Current:

Bertsekas, Dynamic Programming and Optimal Control (4th ed.), Volume II
Agarwal, Jiang, Kakade, Sun, Reinforcement Learning: Theory and Algorithms (2022), Chapter 1

Next Topics

The natural next steps from MDPs:

Policy gradient theorem: optimizing parameterized policies directly
Dynamic programming: the algorithmic paradigm behind value iteration and policy iteration

Last reviewed: April 17, 2026

Canonical graph

Required before and derived from this topic

These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.

Full prerequisite chain All derived topics

Required prerequisites

6

Concentration Inequalitieslayer 1 · tier 1
Convex Optimization Basicslayer 1 · tier 1
Markov Chains and Steady Statelayer 1 · tier 2
Bayesian State Estimationlayer 2 · tier 2
Multi-Armed Bandits Theorylayer 2 · tier 2

Derived topics

23

Bellman Equationslayer 2 · tier 1
Value Iteration and Policy Iterationlayer 2 · tier 1
Policy Gradient Theoremlayer 3 · tier 1
Reward Design and Reward Misspecificationlayer 3 · tier 1
The Era of Experiencelayer 4 · tier 1

+18 more on the derived-topics page.