RL Theory
Robust Adversarial Policies
Robust MDPs optimize against worst-case transition dynamics within an uncertainty set. Adversarial policies formalize distribution shift in RL as a game between agent and environment.
Prerequisites
Why This Matters
Standard RL optimizes a policy for a single MDP. In deployment, the true environment dynamics differ from the training simulator. This gap (sim-to-real, distribution shift, model misspecification) can cause catastrophic failure. Robust MDPs address this by optimizing against the worst-case dynamics within a plausible set, producing policies that degrade gracefully rather than failing abruptly.
Mental Model
Think of a two-player game. The agent chooses actions to maximize cumulative reward. An adversary (Nature) chooses the transition dynamics to minimize it, subject to staying within a specified uncertainty set. The robust policy is the agent's minimax strategy in this game.
Formal Setup
Robust MDP
A robust MDP extends a standard MDP by replacing the fixed transition kernel with an uncertainty set . The set contains all transition kernels the agent considers plausible. The agent maximizes the worst-case value:
where is the value of policy under dynamics .
Rectangular Uncertainty Set
An uncertainty set is (s,a)-rectangular if it decomposes as a product:
where constrains the transition distribution independently for each state-action pair. Rectangularity is what makes dynamic programming tractable for robust MDPs.
Common choices for : (1) Total variation ball of radius around a nominal . (2) KL divergence ball: . (3) Wasserstein ball: .
Main Theorems
Robust Bellman Equation
Statement
Under a rectangular uncertainty set, the robust value function satisfies:
The inner minimization is a convex optimization problem over the local uncertainty set .
Intuition
Rectangularity means the adversary chooses transition probabilities independently at each state-action pair. This decouples the adversary's strategy across time steps, enabling a Bellman-style recursion. Without rectangularity, the adversary could correlate its choices across states, breaking dynamic programming.
Proof Sketch
The key step is showing that the minimax over the adversary's policy can be interchanged with the dynamic programming recursion when the uncertainty set is rectangular. Under rectangularity, the adversary's global strategy is a product of local strategies. By a minimax theorem (the inner problem is convex-concave), the order of max and min can be swapped at each step. The result is a contraction operator, so a unique fixed point exists by Banach's fixed-point theorem.
Why It Matters
This theorem makes robust MDPs computationally tractable. Without it, solving a robust MDP would require optimizing over all possible adversary strategies jointly across time, an intractable problem. With it, robust value iteration has the same structure as standard value iteration, with an added inner minimization at each step.
Failure Mode
Non-rectangular uncertainty sets break the Bellman recursion. If the adversary can correlate perturbations across different states, the adversary's optimal strategy depends on the full trajectory, not just local transitions. This makes the problem much harder (PSPACE-hard in general). Also, overly large uncertainty sets produce conservative policies that perform poorly in the nominal environment.
Connection to Distributionally Robust Optimization
Robust MDPs are the RL counterpart of distributionally robust optimization (DRO). In supervised learning, DRO minimizes the worst-case expected loss over a set of distributions near the empirical distribution. In RL, the robust MDP minimizes the worst-case return over dynamics near the estimated model.
The radius of the uncertainty set controls the conservatism-performance tradeoff. Too small: the policy is fragile to model error. Too large: the policy is overly conservative, sacrificing nominal performance for robustness that may not be needed.
Adversarial Policies as Attacks
A separate line of work studies adversarial policies as attacks: given a trained agent, can a second agent learn to exploit it? Gleave et al. (2020) showed that adversarial policies can defeat state-of-the-art RL agents by taking seemingly random actions that create unusual observations for the victim. The attack works not by being physically dominant but by pushing the victim out of distribution.
This connects back to robust MDPs: if the agent is robust to perturbations in its observation model, adversarial policies become less effective.
Common Confusions
Robust MDP is not the same as reward robustness
Robust MDPs typically perturb transition dynamics, not the reward function. Reward robustness is a separate concern (reward misspecification, reward hacking). The two problems have different mathematical structures and different solutions.
Rectangularity is an assumption, not a theorem
Rectangularity makes the math work. But in reality, model errors are often correlated across states (e.g., a systematic bias in the dynamics model). Using a rectangular uncertainty set may underestimate the true worst case because the adversary is artificially weakened by the independence assumption.
Exercises
Problem
Consider a robust MDP with two states, one action, and a total variation uncertainty set of radius around a nominal transition . The values are and . Compute the worst-case expected next-state value for .
Problem
Explain why non-rectangular uncertainty sets make the robust Bellman equation invalid. Give a concrete example where an adversary benefits from correlating perturbations across two different states.
References
Canonical:
- Iyengar, "Robust Dynamic Programming," Mathematics of Operations Research (2005)
- Nilim & El Ghaoui, "Robust Control of Markov Decision Processes," Operations Research (2005)
Current:
- Gleave et al., "Adversarial Policies: Attacking Deep Reinforcement Learning," ICLR (2020)
- Panaganti & Kalathil, "Robust Reinforcement Learning using Offline Data," NeurIPS (2022)
Next Topics
- Distributionally robust optimization: the supervised learning counterpart of robust MDPs
Last reviewed: April 2026
Prerequisites
Foundations this topic depends on.
- Markov Decision ProcessesLayer 2
- Convex Optimization BasicsLayer 1
- Differentiation in RnLayer 0A
- Sets, Functions, and RelationsLayer 0A
- Basic Logic and Proof TechniquesLayer 0A
- Matrix Operations and PropertiesLayer 0A
- Concentration InequalitiesLayer 1
- Common Probability DistributionsLayer 0A
- Expectation, Variance, Covariance, and MomentsLayer 0A
- Minimax Lower BoundsLayer 3
- Maximum Likelihood EstimationLayer 0B