Robust Adversarial Policies

Sneiderman, Robby

RL Theory

Robust Adversarial Policies

Robust MDPs optimize against worst-case transition dynamics within an uncertainty set. Adversarial policies formalize distribution shift in RL as a game between agent and environment.

AdvancedTier 3CurrentSupporting~45 min

Prerequisites

Markov Decision Processes Minimax Lower Bounds

Prereq Map

Learning position

Read this page in the graph.

rl-theory | layer 4 | tier 3. This page has 2 direct prerequisites and 0 published dependents.

Open Atlas Prerequisites Leads to

What next

Take the diagnostic

No published continuation is declared yet, so the diagnostic is the clean next route.

Evidence badge

Claim status

This page has no public Lean mapping yet. Use the evidence page to inspect how claim status labels work.

Show the backing system

AtlasOpen the full prerequisite graph and run grounding traces.EvidenceInspect source support, claim labels, and public trust status.LeanReview the checked declaration list, scopes, and axiom profile.

Why This Matters

Standard RL optimizes a policy for a single MDP. In deployment, the true environment dynamics differ from the training simulator. This gap (sim-to-real, distribution shift, model misspecification) can cause catastrophic failure. Robust MDPs address this by optimizing against the worst-case dynamics within a plausible set, producing policies that degrade gracefully rather than failing abruptly.

Mental Model

Think of a two-player game. The agent chooses actions to maximize cumulative reward. An adversary (Nature) chooses the transition dynamics to minimize it, subject to staying within a specified uncertainty set. The robust policy is the agent's minimax strategy in this game.

Formal Setup

Definition

Robust MDP $(S, A, U, r, γ)$

A robust MDP extends a standard MDP by replacing the fixed transition kernel $P$ with an uncertainty set $\mathcal{U}$ . The set $\mathcal{U}$ contains all transition kernels the agent considers plausible. The agent maximizes the worst-case value:

$\max_{\pi} \min_{P \in \mathcal{U}} V^{\pi}_P$

where $V^{\pi}_P$ is the value of policy $\pi$ under dynamics $P$ .

Definition

Rectangular Uncertainty Set

An uncertainty set $\mathcal{U}$ is (s,a)-rectangular if it decomposes as a product:

$\mathcal{U} = \prod_{(s,a)} \mathcal{U}_{s,a}$

where $\mathcal{U}_{s,a}$ constrains the transition distribution $P(\cdot | s, a)$ independently for each state-action pair. Rectangularity is what makes dynamic programming tractable for robust MDPs.

Common choices for $\mathcal{U}_{s,a}$ : (1) Total variation ball of radius $\alpha$ around a nominal $\hat{P}(\cdot|s,a)$ . (2) KL divergence ball: $\{Q : D_{\text{KL}}(Q \| \hat{P}) \leq \alpha\}$ . (3) Wasserstein ball: $\{Q : W_1(Q, \hat{P}) \leq \alpha\}$ .

Throughout this page, total-variation distance uses the probability-theory convention $\text{TV}(P, Q) := \frac{1}{2} \sum_x |P(x) - Q(x)|$ , with range $[0, 1]$ . Iyengar (2005) and Nilim & El Ghaoui (2005) state their robust-MDP results in equivalent terms up to a factor of 2 in the radius.

Main Theorems

Theorem

Robust Bellman Equation

Statement

Under a rectangular uncertainty set, the robust value function $V^*(s) = \max_\pi \min_{P \in \mathcal{U}} V^{\pi}_P(s)$ satisfies:

$V^*(s) = \max_{a \in A} \left[ r(s, a) + \gamma \min_{P_{s,a} \in \mathcal{U}_{s,a}} \sum_{s'} P_{s,a}(s') V^*(s') \right]$

The inner minimization is a convex optimization problem over the local uncertainty set $\mathcal{U}_{s,a}$ .

Intuition

Rectangularity means the adversary chooses transition probabilities independently at each state-action pair. This decouples the adversary's strategy across time steps, enabling a Bellman-style recursion. Without rectangularity, the adversary could correlate its choices across states, breaking dynamic programming.

Proof Sketch

The key step is showing that the minimax over the adversary's policy can be interchanged with the dynamic programming recursion when the uncertainty set is rectangular. Under rectangularity, the adversary's global strategy is a product of local strategies. By a minimax theorem (the inner problem is convex-concave), the order of max and min can be swapped at each step. The result is a contraction operator, so a unique fixed point exists by Banach's fixed-point theorem.

Why It Matters

This theorem makes robust MDPs computationally tractable. Without it, solving a robust MDP would require optimizing over all possible adversary strategies jointly across time, an intractable problem. With it, robust value iteration has the same structure as standard value iteration, with an added inner minimization at each step.

Failure Mode

Non-rectangular uncertainty sets break the Bellman recursion. If the adversary can correlate perturbations across different states, the adversary's optimal strategy depends on the full trajectory, not just local transitions. This makes the problem much harder (PSPACE-hard in general). Also, overly large uncertainty sets produce conservative policies that perform poorly in the nominal environment.

report a correction →

Connection to Distributionally Robust Optimization

Robust MDPs are the RL counterpart of distributionally robust optimization (DRO). In supervised learning, DRO minimizes the worst-case expected loss over a set of distributions near the empirical distribution. In RL, the robust MDP maximizes the worst-case return over dynamics near the estimated model (equivalently, minimizes the worst-case cost): the agent picks a policy whose worst-case value across the uncertainty set is as large as possible.

The radius of the uncertainty set $\alpha$ controls the conservatism-performance tradeoff. Too small: the policy is fragile to model error. Too large: the policy is overly conservative, sacrificing nominal performance for robustness that may not be needed.

Adversarial Policies as Attacks

A separate line of work studies adversarial policies as attacks: given a trained agent, can a second agent learn to exploit it? Gleave et al. (2020) showed that adversarial policies can defeat state-of-the-art RL agents by taking seemingly random actions that create unusual observations for the victim. The attack works not by being physically dominant but by pushing the victim out of distribution.

This connects back to robust MDPs: if the agent is robust to perturbations in its observation model, adversarial policies become less effective.

Common Confusions

Watch Out

Robust MDP is not the same as reward robustness

Robust MDPs typically perturb transition dynamics, not the reward function. Reward robustness is a separate concern (reward misspecification, reward hacking). The two problems have different mathematical structures and different solutions.

Watch Out

Rectangularity is an assumption, not a theorem

Rectangularity makes the math work. But in reality, model errors are often correlated across states (e.g., a systematic bias in the dynamics model). Using a rectangular uncertainty set may underestimate the true worst case because the adversary is artificially weakened by the independence assumption.

Exercises

ExerciseCore

Problem

Consider a robust MDP with two states, one action, and a total variation uncertainty set of radius $\alpha$ around a nominal transition $\hat{P} = [0.8, 0.2]$ . The values are $V^*(s_1) = 10$ and $V^*(s_2) = 2$ . Compute the worst-case expected next-state value $\min_{P \in \mathcal{U}} \sum_{s'} P(s') V^*(s')$ for $\alpha = 0.1$ . Use the probability-theory TV convention $\text{TV}(P, Q) = \frac{1}{2} \sum_x |P(x) - Q(x)|$ (range $[0, 1]$ ).

ExerciseAdvanced

Problem

Explain why non-rectangular uncertainty sets make the robust Bellman equation invalid. Give a concrete example where an adversary benefits from correlating perturbations across two different states.

References

Canonical:

Iyengar, "Robust Dynamic Programming," Mathematics of Operations Research (2005)
Nilim & El Ghaoui, "Robust Control of Markov Decision Processes," Operations Research (2005)

Current:

Gleave et al., "Adversarial Policies: Attacking Deep Reinforcement Learning," ICLR (2020)
Panaganti & Kalathil, "Robust Reinforcement Learning using Offline Data," NeurIPS (2022)

Next Topics

Distributionally robust optimization: the supervised learning counterpart of robust MDPs

Last reviewed: April 26, 2026

Canonical graph

Required before and derived from this topic

These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.

Full prerequisite chain All derived topics

Required prerequisites

2

Markov Decision Processeslayer 2 · tier 1
Minimax Lower Bounds: Le Cam, Fano, Assouad, and the Reduction to Testinglayer 3 · tier 1

Derived topics

0

No published topic currently declares this as a prerequisite.