Beta. Content is under active construction and has not been peer-reviewed. Report errors on GitHub.Disclaimer

RL Theory

Robust Adversarial Policies

Robust MDPs optimize against worst-case transition dynamics within an uncertainty set. Adversarial policies formalize distribution shift in RL as a game between agent and environment.

AdvancedTier 3Current~45 min
0

Why This Matters

Standard RL optimizes a policy for a single MDP. In deployment, the true environment dynamics differ from the training simulator. This gap (sim-to-real, distribution shift, model misspecification) can cause catastrophic failure. Robust MDPs address this by optimizing against the worst-case dynamics within a plausible set, producing policies that degrade gracefully rather than failing abruptly.

Mental Model

Think of a two-player game. The agent chooses actions to maximize cumulative reward. An adversary (Nature) chooses the transition dynamics to minimize it, subject to staying within a specified uncertainty set. The robust policy is the agent's minimax strategy in this game.

Formal Setup

Definition

Robust MDP

A robust MDP extends a standard MDP by replacing the fixed transition kernel PP with an uncertainty set U\mathcal{U}. The set U\mathcal{U} contains all transition kernels the agent considers plausible. The agent maximizes the worst-case value:

maxπminPUVPπ\max_{\pi} \min_{P \in \mathcal{U}} V^{\pi}_P

where VPπV^{\pi}_P is the value of policy π\pi under dynamics PP.

Definition

Rectangular Uncertainty Set

An uncertainty set U\mathcal{U} is (s,a)-rectangular if it decomposes as a product:

U=(s,a)Us,a\mathcal{U} = \prod_{(s,a)} \mathcal{U}_{s,a}

where Us,a\mathcal{U}_{s,a} constrains the transition distribution P(s,a)P(\cdot | s, a) independently for each state-action pair. Rectangularity is what makes dynamic programming tractable for robust MDPs.

Common choices for Us,a\mathcal{U}_{s,a}: (1) Total variation ball of radius α\alpha around a nominal P^(s,a)\hat{P}(\cdot|s,a). (2) KL divergence ball: {Q:DKL(QP^)α}\{Q : D_{\text{KL}}(Q \| \hat{P}) \leq \alpha\}. (3) Wasserstein ball: {Q:W1(Q,P^)α}\{Q : W_1(Q, \hat{P}) \leq \alpha\}.

Main Theorems

Theorem

Robust Bellman Equation

Statement

Under a rectangular uncertainty set, the robust value function V(s)=maxπminPUVPπ(s)V^*(s) = \max_\pi \min_{P \in \mathcal{U}} V^{\pi}_P(s) satisfies:

V(s)=maxaA[r(s,a)+γminPs,aUs,asPs,a(s)V(s)]V^*(s) = \max_{a \in A} \left[ r(s, a) + \gamma \min_{P_{s,a} \in \mathcal{U}_{s,a}} \sum_{s'} P_{s,a}(s') V^*(s') \right]

The inner minimization is a convex optimization problem over the local uncertainty set Us,a\mathcal{U}_{s,a}.

Intuition

Rectangularity means the adversary chooses transition probabilities independently at each state-action pair. This decouples the adversary's strategy across time steps, enabling a Bellman-style recursion. Without rectangularity, the adversary could correlate its choices across states, breaking dynamic programming.

Proof Sketch

The key step is showing that the minimax over the adversary's policy can be interchanged with the dynamic programming recursion when the uncertainty set is rectangular. Under rectangularity, the adversary's global strategy is a product of local strategies. By a minimax theorem (the inner problem is convex-concave), the order of max and min can be swapped at each step. The result is a contraction operator, so a unique fixed point exists by Banach's fixed-point theorem.

Why It Matters

This theorem makes robust MDPs computationally tractable. Without it, solving a robust MDP would require optimizing over all possible adversary strategies jointly across time, an intractable problem. With it, robust value iteration has the same structure as standard value iteration, with an added inner minimization at each step.

Failure Mode

Non-rectangular uncertainty sets break the Bellman recursion. If the adversary can correlate perturbations across different states, the adversary's optimal strategy depends on the full trajectory, not just local transitions. This makes the problem much harder (PSPACE-hard in general). Also, overly large uncertainty sets produce conservative policies that perform poorly in the nominal environment.

Connection to Distributionally Robust Optimization

Robust MDPs are the RL counterpart of distributionally robust optimization (DRO). In supervised learning, DRO minimizes the worst-case expected loss over a set of distributions near the empirical distribution. In RL, the robust MDP minimizes the worst-case return over dynamics near the estimated model.

The radius of the uncertainty set α\alpha controls the conservatism-performance tradeoff. Too small: the policy is fragile to model error. Too large: the policy is overly conservative, sacrificing nominal performance for robustness that may not be needed.

Adversarial Policies as Attacks

A separate line of work studies adversarial policies as attacks: given a trained agent, can a second agent learn to exploit it? Gleave et al. (2020) showed that adversarial policies can defeat state-of-the-art RL agents by taking seemingly random actions that create unusual observations for the victim. The attack works not by being physically dominant but by pushing the victim out of distribution.

This connects back to robust MDPs: if the agent is robust to perturbations in its observation model, adversarial policies become less effective.

Common Confusions

Watch Out

Robust MDP is not the same as reward robustness

Robust MDPs typically perturb transition dynamics, not the reward function. Reward robustness is a separate concern (reward misspecification, reward hacking). The two problems have different mathematical structures and different solutions.

Watch Out

Rectangularity is an assumption, not a theorem

Rectangularity makes the math work. But in reality, model errors are often correlated across states (e.g., a systematic bias in the dynamics model). Using a rectangular uncertainty set may underestimate the true worst case because the adversary is artificially weakened by the independence assumption.

Exercises

ExerciseCore

Problem

Consider a robust MDP with two states, one action, and a total variation uncertainty set of radius α\alpha around a nominal transition P^=[0.8,0.2]\hat{P} = [0.8, 0.2]. The values are V(s1)=10V^*(s_1) = 10 and V(s2)=2V^*(s_2) = 2. Compute the worst-case expected next-state value minPUsP(s)V(s)\min_{P \in \mathcal{U}} \sum_{s'} P(s') V^*(s') for α=0.1\alpha = 0.1.

ExerciseAdvanced

Problem

Explain why non-rectangular uncertainty sets make the robust Bellman equation invalid. Give a concrete example where an adversary benefits from correlating perturbations across two different states.

References

Canonical:

  • Iyengar, "Robust Dynamic Programming," Mathematics of Operations Research (2005)
  • Nilim & El Ghaoui, "Robust Control of Markov Decision Processes," Operations Research (2005)

Current:

  • Gleave et al., "Adversarial Policies: Attacking Deep Reinforcement Learning," ICLR (2020)
  • Panaganti & Kalathil, "Robust Reinforcement Learning using Offline Data," NeurIPS (2022)

Next Topics

  • Distributionally robust optimization: the supervised learning counterpart of robust MDPs

Last reviewed: April 2026

Prerequisites

Foundations this topic depends on.