Beta. Content is under active construction and has not been peer-reviewed. Report errors on GitHub.Disclaimer

RL Theory

Markov Decision Processes

The mathematical framework for sequential decision-making under uncertainty: states, actions, transitions, rewards, and the Bellman equations that make solving them possible.

CoreTier 1Stable~70 min

Why This Matters

a₁r=+1a₂r=0a₁r=+5a₂r=-1a₁r=+2a₂r=0s₀s₁s₂s₃terminalActions (a)Rewards (r)States (s)Agent chooses action, environment returns next state + reward

Every reinforcement learning algorithm is either solving an MDP directly or approximating a solution to one. When AlphaGo plays a move, it is approximately solving an MDP. When a robot learns to walk, the problem is formulated as an MDP. When a language model is fine-tuned with RLHF, the training loop treats token generation as an MDP.

MDPs provide the mathematical language for sequential decision-making. Without understanding MDPs, reinforcement learning is just heuristics.

Mental Model

You are an agent in an environment. At each time step, you observe a state, take an action, receive a reward, and transition to a new state. Your goal is to choose actions that maximize the total reward you accumulate over time. The transition to the next state depends only on the current state and action, not on the history. This is the Markov property.

The Bellman equations express a recursive structure: the value of being in a state equals the immediate reward plus the discounted value of the next state. This recursion is the engine behind every dynamic programming and RL algorithm.

Formal Setup and Notation

Definition

Markov Decision Process

A Markov decision process is a tuple (S,A,P,R,γ)(\mathcal{S}, \mathcal{A}, P, R, \gamma) where:

  • S\mathcal{S} is a finite set of states
  • A\mathcal{A} is a finite set of actions
  • P:S×A×S[0,1]P: \mathcal{S} \times \mathcal{A} \times \mathcal{S} \to [0,1] is the transition function, where P(ss,a)P(s' | s, a) is the probability of transitioning to state ss' when taking action aa in state ss
  • R:S×ARR: \mathcal{S} \times \mathcal{A} \to \mathbb{R} is the reward function, where R(s,a)R(s, a) is the expected immediate reward for taking action aa in state ss
  • γ[0,1)\gamma \in [0, 1) is the discount factor
Definition

Policy

A policy π:SΔ(A)\pi: \mathcal{S} \to \Delta(\mathcal{A}) maps each state to a probability distribution over actions. A deterministic policy maps each state to a single action: π:SA\pi: \mathcal{S} \to \mathcal{A}.

Definition

Return

The return from time step tt is the discounted sum of future rewards:

Gt=k=0γkrt+kG_t = \sum_{k=0}^{\infty} \gamma^k r_{t+k}

The discount factor γ<1\gamma < 1 ensures this sum is finite when rewards are bounded.

Core Definitions

Definition

State-Value Function

The state-value function for a policy π\pi gives the expected return starting from state ss and following π\pi thereafter:

Vπ(s)=Eπ[k=0γkrt+k  |  st=s]V^\pi(s) = \mathbb{E}_\pi\left[\sum_{k=0}^{\infty} \gamma^k r_{t+k} \;\middle|\; s_t = s\right]

Definition

Action-Value Function

The action-value function for a policy π\pi gives the expected return starting from state ss, taking action aa, and following π\pi thereafter:

Qπ(s,a)=Eπ[k=0γkrt+k  |  st=s,at=a]Q^\pi(s, a) = \mathbb{E}_\pi\left[\sum_{k=0}^{\infty} \gamma^k r_{t+k} \;\middle|\; s_t = s, a_t = a\right]

The relationship between VV and QQ is:

Vπ(s)=aAπ(as)Qπ(s,a)V^\pi(s) = \sum_{a \in \mathcal{A}} \pi(a|s) \, Q^\pi(s, a)

Main Theorems

Theorem

Bellman Expectation Equation

Statement

For any policy π\pi, the value function VπV^\pi satisfies:

Vπ(s)=aπ(as)[R(s,a)+γsP(ss,a)Vπ(s)]V^\pi(s) = \sum_{a} \pi(a|s) \left[ R(s,a) + \gamma \sum_{s'} P(s'|s,a) \, V^\pi(s') \right]

In matrix form, writing RπR^\pi for the expected reward vector under π\pi and PπP^\pi for the transition matrix under π\pi:

Vπ=Rπ+γPπVπV^\pi = R^\pi + \gamma P^\pi V^\pi

Intuition

The value of a state is the expected immediate reward plus the discounted expected value of the next state. This recursion expresses the idea that the future looks the same from any point in a Markov process. SO we can break the infinite-horizon problem into one step plus the rest.

Proof Sketch

Expand the definition of Vπ(s)V^\pi(s), split the sum into the first reward and the remaining terms, and use the Markov property to replace the conditional expectation of future returns with Vπ(s)V^\pi(s').

Why It Matters

The Bellman expectation equation is a system of S|\mathcal{S}| linear equations in S|\mathcal{S}| unknowns. For a fixed policy, the value function can be computed exactly by solving Vπ=(IγPπ)1RπV^\pi = (I - \gamma P^\pi)^{-1} R^\pi. This is the basis of policy evaluation.

Theorem

Bellman Optimality Equation

Statement

The optimal value function VV^* satisfies:

V(s)=maxaA[R(s,a)+γsP(ss,a)V(s)]V^*(s) = \max_{a \in \mathcal{A}} \left[ R(s,a) + \gamma \sum_{s'} P(s'|s,a) \, V^*(s') \right]

Equivalently, the optimal action-value function satisfies:

Q(s,a)=R(s,a)+γsP(ss,a)maxaQ(s,a)Q^*(s,a) = R(s,a) + \gamma \sum_{s'} P(s'|s,a) \max_{a'} Q^*(s', a')

Intuition

The optimal value of a state is obtained by choosing the action that maximizes the immediate reward plus discounted future value. The max replaces the expectation over the policy because the optimal agent always picks the best action.

Proof Sketch

Define V(s)=supπVπ(s)V^*(s) = \sup_\pi V^\pi(s). Show that any policy that is greedy with respect to VV^* achieves VV^*. The max arises because the optimal policy is deterministic for finite MDPs (for any optimal value function, a greedy deterministic policy is optimal).

Why It Matters

Unlike the Bellman expectation equation, the Bellman optimality equation is nonlinear (because of the max). It cannot be solved by matrix inversion. Instead, we solve it iteratively via value iteration or policy iteration.

Theorem

Bellman Operator is a Contraction

Statement

Define the Bellman optimality operator T:RSRS\mathcal{T}: \mathbb{R}^{|\mathcal{S}|} \to \mathbb{R}^{|\mathcal{S}|} by:

(TV)(s)=maxa[R(s,a)+γsP(ss,a)V(s)](\mathcal{T}V)(s) = \max_{a} \left[ R(s,a) + \gamma \sum_{s'} P(s'|s,a) \, V(s') \right]

Then T\mathcal{T} is a γ\gamma-contraction in the \ell^\infty norm:

TVTUγVU\|\mathcal{T}V - \mathcal{T}U\|_\infty \leq \gamma \|V - U\|_\infty

By the Banach fixed point theorem, T\mathcal{T} has a unique fixed point VV^*, and value iteration Vk+1=TVkV_{k+1} = \mathcal{T}V_k converges to VV^* at rate γk\gamma^k.

Intuition

Applying the Bellman update brings any value estimate closer to the true optimal value. The discount factor γ<1\gamma < 1 is doing the heavy lifting: each application of T\mathcal{T} shrinks errors by a factor of γ\gamma.

Proof Sketch

For any two value functions V,UV, U and any state ss, let aa^* achieve the max for VV. Then:

(TV)(s)(TU)(s)γsP(ss,a)(V(s)U(s))γVU(\mathcal{T}V)(s) - (\mathcal{T}U)(s) \leq \gamma \sum_{s'} P(s'|s,a^*)(V(s') - U(s')) \leq \gamma \|V - U\|_\infty

By symmetry the same bound holds with VV and UU swapped, giving the result.

Why It Matters

This theorem guarantees that value iteration converges, and tells us the rate. After kk iterations, VkVγkV0V\|V_k - V^*\|_\infty \leq \gamma^k \|V_0 - V^*\|_\infty. The contraction property is the reason dynamic programming works.

Failure Mode

If γ=1\gamma = 1 (undiscounted), the operator is not a contraction and value iteration may not converge. The undiscounted case requires additional structure (e.g., all policies reach a terminal state).

Theorem

Policy Improvement Theorem

Statement

Let π\pi be a policy and define the greedy policy π\pi' by:

π(s)=argmaxaQπ(s,a)\pi'(s) = \arg\max_{a} Q^\pi(s, a)

Then Vπ(s)Vπ(s)V^{\pi'}(s) \geq V^\pi(s) for all sSs \in \mathcal{S}, with equality for all ss if and only if π\pi is already optimal.

Intuition

If you evaluate a policy, then act greedily with respect to the resulting value function, you can only do better. This is the engine behind policy iteration: evaluate, improve, repeat.

Proof Sketch

Start from Vπ(s)maxaQπ(s,a)=Qπ(s,π(s))V^\pi(s) \leq \max_a Q^\pi(s,a) = Q^\pi(s, \pi'(s)). Expand the right side using the Bellman equation and iterate, obtaining Vπ(s)Vπ(s)V^\pi(s) \leq V^{\pi'}(s).

Why It Matters

Combined with policy evaluation, this gives the policy iteration algorithm: evaluate π\pi exactly, improve to π\pi', and repeat. Since there are finitely many deterministic policies and each step improves the value, policy iteration terminates at the optimal policy in finite steps.

Algorithms from the Theory

Policy Evaluation. Given a fixed policy π\pi, solve Vπ=Rπ+γPπVπV^\pi = R^\pi + \gamma P^\pi V^\pi either by matrix inversion (exact, O(S3)O(|\mathcal{S}|^3)) or by iterative application of the Bellman expectation operator (converges at rate γk\gamma^k).

Policy Iteration. Alternate between:

  1. Evaluate: Compute VπV^\pi exactly
  2. Improve: Set π(s)=argmaxaQπ(s,a)\pi'(s) = \arg\max_a Q^\pi(s,a) for all ss
  3. Repeat until π=π\pi' = \pi

Policy iteration converges in at most AS|\mathcal{A}|^{|\mathcal{S}|} iterations (the number of deterministic policies), but in practice converges much faster.

Value Iteration. Repeatedly apply the Bellman optimality operator: Vk+1(s)=maxa[R(s,a)+γsP(ss,a)Vk(s)]V_{k+1}(s) = \max_a [R(s,a) + \gamma \sum_{s'} P(s'|s,a) V_k(s')]. This converges to VV^* by the contraction theorem. Extract the optimal policy at the end: π(s)=argmaxaQ(s,a)\pi^*(s) = \arg\max_a Q^*(s,a).

Connection to Dynamic Programming

MDPs are the probabilistic generalization of deterministic dynamic programming. In a deterministic system, transitions are certain: P(ss,a)=1P(s'|s,a) = 1 for a single ss'. The Bellman optimality equation reduces to:

V(s)=maxa[R(s,a)+γV(f(s,a))]V^*(s) = \max_a [R(s,a) + \gamma V^*(f(s,a))]

where f(s,a)f(s,a) is the deterministic next state. This is exactly the recursive structure exploited in classical DP algorithms (shortest paths, sequence alignment, etc.).

Common Confusions

Watch Out

MDP vs. bandit

A multi-armed bandit is an MDP with a single state. There are no transitions and no sequential structure. The entire difficulty of MDPs comes from the fact that actions affect future states. In bandits, the challenge is purely exploration vs. exploitation within a single decision.

Watch Out

Value iteration vs. policy iteration

Value iteration updates the value function directly using the Bellman optimality operator. Policy iteration alternates between exact policy evaluation and greedy improvement. Policy iteration often converges in fewer iterations (each iteration is more expensive), but value iteration avoids solving a linear system at each step. For large state spaces, both are replaced by approximate methods.

Watch Out

Discount factor as a modeling choice vs. technical necessity

Students often ask why we need γ<1\gamma < 1. There are two reasons: (1) it models the preference for sooner rewards, and (2) it is technically necessary for the Bellman operator to be a contraction. Without discounting, value functions can be infinite and convergence guarantees break.

Summary

  • An MDP is defined by (S,A,P,R,γ)(\mathcal{S}, \mathcal{A}, P, R, \gamma)
  • Vπ(s)V^\pi(s) is the expected discounted return under policy π\pi from state ss
  • The Bellman expectation equation is linear: Vπ=Rπ+γPπVπV^\pi = R^\pi + \gamma P^\pi V^\pi
  • The Bellman optimality equation is nonlinear: V(s)=maxa[R(s,a)+γsP(ss,a)V(s)]V^*(s) = \max_a [R(s,a) + \gamma \sum_{s'} P(s'|s,a) V^*(s')]
  • The Bellman optimality operator is a γ\gamma-contraction, so value iteration converges
  • Policy iteration: evaluate exactly, then improve greedily. guaranteed to find optimal policy

Exercises

ExerciseCore

Problem

Consider an MDP with two states {s1,s2}\{s_1, s_2\}, one action {a}\{a\}, transitions P(s1s1,a)=0.5P(s_1|s_1,a) = 0.5, P(s2s1,a)=0.5P(s_2|s_1,a) = 0.5, P(s2s2,a)=1P(s_2|s_2,a) = 1, rewards R(s1,a)=1R(s_1,a) = 1, R(s2,a)=0R(s_2,a) = 0, and γ=0.9\gamma = 0.9. Compute Vπ(s1)V^\pi(s_1) and Vπ(s2)V^\pi(s_2).

ExerciseCore

Problem

Prove that the Bellman expectation operator TπV=Rπ+γPπV\mathcal{T}^\pi V = R^\pi + \gamma P^\pi V is also a γ\gamma-contraction in the \ell^\infty norm.

ExerciseAdvanced

Problem

Show that policy iteration terminates in a finite number of steps for any finite MDP with γ<1\gamma < 1.

Related Comparisons

References

Canonical:

  • Puterman, Markov Decision Processes (1994), Chapters 2, 6
  • Sutton & Barto, Reinforcement Learning: An Introduction (2018), Chapters 3-4

Current:

  • Bertsekas, Dynamic Programming and Optimal Control (4th ed.), Volume II
  • Agarwal, Jiang, Kakade, Sun, Reinforcement Learning: Theory and Algorithms (2022), Chapter 1

Next Topics

The natural next steps from MDPs:

Last reviewed: April 2026

Prerequisites

Foundations this topic depends on.

Builds on This

Next Topics