RL Theory
Markov Decision Processes
The mathematical framework for sequential decision-making under uncertainty: states, actions, transitions, rewards, and the Bellman equations that make solving them possible.
Prerequisites
Why This Matters
Every reinforcement learning algorithm is either solving an MDP directly or approximating a solution to one. When AlphaGo plays a move, it is approximately solving an MDP. When a robot learns to walk, the problem is formulated as an MDP. When a language model is fine-tuned with RLHF, the training loop treats token generation as an MDP.
MDPs provide the mathematical language for sequential decision-making. Without understanding MDPs, reinforcement learning is just heuristics.
Mental Model
You are an agent in an environment. At each time step, you observe a state, take an action, receive a reward, and transition to a new state. Your goal is to choose actions that maximize the total reward you accumulate over time. The transition to the next state depends only on the current state and action, not on the history. This is the Markov property.
The Bellman equations express a recursive structure: the value of being in a state equals the immediate reward plus the discounted value of the next state. This recursion is the engine behind every dynamic programming and RL algorithm.
Formal Setup and Notation
Markov Decision Process
A Markov decision process is a tuple where:
- is a finite set of states
- is a finite set of actions
- is the transition function, where is the probability of transitioning to state when taking action in state
- is the reward function, where is the expected immediate reward for taking action in state
- is the discount factor
Policy
A policy maps each state to a probability distribution over actions. A deterministic policy maps each state to a single action: .
Return
The return from time step is the discounted sum of future rewards:
The discount factor ensures this sum is finite when rewards are bounded.
Core Definitions
State-Value Function
The state-value function for a policy gives the expected return starting from state and following thereafter:
Action-Value Function
The action-value function for a policy gives the expected return starting from state , taking action , and following thereafter:
The relationship between and is:
Main Theorems
Bellman Expectation Equation
Statement
For any policy , the value function satisfies:
In matrix form, writing for the expected reward vector under and for the transition matrix under :
Intuition
The value of a state is the expected immediate reward plus the discounted expected value of the next state. This recursion expresses the idea that the future looks the same from any point in a Markov process. SO we can break the infinite-horizon problem into one step plus the rest.
Proof Sketch
Expand the definition of , split the sum into the first reward and the remaining terms, and use the Markov property to replace the conditional expectation of future returns with .
Why It Matters
The Bellman expectation equation is a system of linear equations in unknowns. For a fixed policy, the value function can be computed exactly by solving . This is the basis of policy evaluation.
Bellman Optimality Equation
Statement
The optimal value function satisfies:
Equivalently, the optimal action-value function satisfies:
Intuition
The optimal value of a state is obtained by choosing the action that maximizes the immediate reward plus discounted future value. The max replaces the expectation over the policy because the optimal agent always picks the best action.
Proof Sketch
Define . Show that any policy that is greedy with respect to achieves . The max arises because the optimal policy is deterministic for finite MDPs (for any optimal value function, a greedy deterministic policy is optimal).
Why It Matters
Unlike the Bellman expectation equation, the Bellman optimality equation is nonlinear (because of the max). It cannot be solved by matrix inversion. Instead, we solve it iteratively via value iteration or policy iteration.
Bellman Operator is a Contraction
Statement
Define the Bellman optimality operator by:
Then is a -contraction in the norm:
By the Banach fixed point theorem, has a unique fixed point , and value iteration converges to at rate .
Intuition
Applying the Bellman update brings any value estimate closer to the true optimal value. The discount factor is doing the heavy lifting: each application of shrinks errors by a factor of .
Proof Sketch
For any two value functions and any state , let achieve the max for . Then:
By symmetry the same bound holds with and swapped, giving the result.
Why It Matters
This theorem guarantees that value iteration converges, and tells us the rate. After iterations, . The contraction property is the reason dynamic programming works.
Failure Mode
If (undiscounted), the operator is not a contraction and value iteration may not converge. The undiscounted case requires additional structure (e.g., all policies reach a terminal state).
Policy Improvement Theorem
Statement
Let be a policy and define the greedy policy by:
Then for all , with equality for all if and only if is already optimal.
Intuition
If you evaluate a policy, then act greedily with respect to the resulting value function, you can only do better. This is the engine behind policy iteration: evaluate, improve, repeat.
Proof Sketch
Start from . Expand the right side using the Bellman equation and iterate, obtaining .
Why It Matters
Combined with policy evaluation, this gives the policy iteration algorithm: evaluate exactly, improve to , and repeat. Since there are finitely many deterministic policies and each step improves the value, policy iteration terminates at the optimal policy in finite steps.
Algorithms from the Theory
Policy Evaluation. Given a fixed policy , solve either by matrix inversion (exact, ) or by iterative application of the Bellman expectation operator (converges at rate ).
Policy Iteration. Alternate between:
- Evaluate: Compute exactly
- Improve: Set for all
- Repeat until
Policy iteration converges in at most iterations (the number of deterministic policies), but in practice converges much faster.
Value Iteration. Repeatedly apply the Bellman optimality operator: . This converges to by the contraction theorem. Extract the optimal policy at the end: .
Connection to Dynamic Programming
MDPs are the probabilistic generalization of deterministic dynamic programming. In a deterministic system, transitions are certain: for a single . The Bellman optimality equation reduces to:
where is the deterministic next state. This is exactly the recursive structure exploited in classical DP algorithms (shortest paths, sequence alignment, etc.).
Common Confusions
MDP vs. bandit
A multi-armed bandit is an MDP with a single state. There are no transitions and no sequential structure. The entire difficulty of MDPs comes from the fact that actions affect future states. In bandits, the challenge is purely exploration vs. exploitation within a single decision.
Value iteration vs. policy iteration
Value iteration updates the value function directly using the Bellman optimality operator. Policy iteration alternates between exact policy evaluation and greedy improvement. Policy iteration often converges in fewer iterations (each iteration is more expensive), but value iteration avoids solving a linear system at each step. For large state spaces, both are replaced by approximate methods.
Discount factor as a modeling choice vs. technical necessity
Students often ask why we need . There are two reasons: (1) it models the preference for sooner rewards, and (2) it is technically necessary for the Bellman operator to be a contraction. Without discounting, value functions can be infinite and convergence guarantees break.
Summary
- An MDP is defined by
- is the expected discounted return under policy from state
- The Bellman expectation equation is linear:
- The Bellman optimality equation is nonlinear:
- The Bellman optimality operator is a -contraction, so value iteration converges
- Policy iteration: evaluate exactly, then improve greedily. guaranteed to find optimal policy
Exercises
Problem
Consider an MDP with two states , one action , transitions , , , rewards , , and . Compute and .
Problem
Prove that the Bellman expectation operator is also a -contraction in the norm.
Problem
Show that policy iteration terminates in a finite number of steps for any finite MDP with .
Related Comparisons
References
Canonical:
- Puterman, Markov Decision Processes (1994), Chapters 2, 6
- Sutton & Barto, Reinforcement Learning: An Introduction (2018), Chapters 3-4
Current:
- Bertsekas, Dynamic Programming and Optimal Control (4th ed.), Volume II
- Agarwal, Jiang, Kakade, Sun, Reinforcement Learning: Theory and Algorithms (2022), Chapter 1
Next Topics
The natural next steps from MDPs:
- Policy gradient theorem: optimizing parameterized policies directly
- Dynamic programming: the algorithmic paradigm behind value iteration and policy iteration
Last reviewed: April 2026
Prerequisites
Foundations this topic depends on.
- Convex Optimization BasicsLayer 1
- Differentiation in RnLayer 0A
- Sets, Functions, and RelationsLayer 0A
- Basic Logic and Proof TechniquesLayer 0A
- Matrix Operations and PropertiesLayer 0A
- Concentration InequalitiesLayer 1
- Common Probability DistributionsLayer 0A
- Expectation, Variance, Covariance, and MomentsLayer 0A
Builds on This
- Active SLAM and POMDPsLayer 4
- Agentic RL and Tool UseLayer 5
- Bellman EquationsLayer 2
- The Era of ExperienceLayer 4
- Exploration vs ExploitationLayer 2
- Markov Games and Self-PlayLayer 3
- Mean-Field GamesLayer 4
- Model-Based Reinforcement LearningLayer 3
- Options and Temporal AbstractionLayer 3
- Policy Gradient TheoremLayer 3
- Policy RepresentationsLayer 3
- Reinforcement Learning Environments and BenchmarksLayer 3
- Reward Design and Reward MisspecificationLayer 3
- RLHF and AlignmentLayer 4
- Robust Adversarial PoliciesLayer 4
- Self-Play and Multi-Agent RLLayer 3
- Temporal Difference LearningLayer 2
- Value Iteration and Policy IterationLayer 2
- World Models and PlanningLayer 4