Offline Reinforcement Learning

Sneiderman, Robby

RL Theory

Offline Reinforcement Learning

Learning policies from a fixed dataset without environment interaction: distributional shift as the core challenge, conservative Q-learning (CQL) as the standard fix, and Decision Transformer as an alternative sequence modeling approach.

AdvancedTier 2CurrentSupporting~55 min

Prerequisites

Q Learning

Start 8-question practice · 2 available Prereq Map

Learning position

Read this page in the graph.

rl-theory | layer 3 | tier 2. This page has 1 direct prerequisite and 2 published dependents.

Open Atlas Prerequisites Leads to

What next

Policy Optimization: PPO and TRPO

This is the first curated or graph-derived continuation from the current page.

Evidence badge

Claim status

This page has no public Lean mapping yet. Use the evidence page to inspect how claim status labels work.

Show the backing system

AtlasOpen the full prerequisite graph and run grounding traces.EvidenceInspect source support, claim labels, and public trust status.LeanReview the checked declaration list, scopes, and axiom profile.

Why This Matters

Standard RL (online RL) learns by interacting with an environment: take an action, observe the result, update the policy. This is impossible or dangerous in many settings. You cannot crash real cars to train a self-driving policy. You cannot give patients random treatments to learn a medical protocol.

Offline RL (also called batch RL) learns from a fixed dataset of $(s, a, r, s')$ transitions collected by some behavior policy. The learner never interacts with the environment. This sounds like supervised learning, but the key difference is that the learned policy may visit states and take actions not in the dataset, creating a distributional shift that causes catastrophic overestimation of Q-values.

Formal Setup

Definition

Offline RL Setting

Given a fixed dataset $\mathcal{D} = \{(s_i, a_i, r_i, s'_i)\}_{i=1}^N$ collected by a behavior policy $\pi_\beta$ , learn a policy $\pi$ that maximizes expected discounted return $J(\pi) = \mathbb{E}_\pi[\sum_{t=0}^\infty \gamma^t r_t]$ without any additional environment interaction.

The dataset distribution is $d^{\pi_\beta}(s, a)$ , the state-action visitation frequency of $\pi_\beta$ . The learned policy $\pi$ induces a different distribution $d^\pi(s, a)$ . When $d^\pi$ puts mass on $(s, a)$ pairs not covered by $\mathcal{D}$ , we have distributional shift.

The Core Problem: Extrapolation Error

Proposition

Extrapolation Error in Offline Q-Learning

Statement

When standard Q-learning is applied to a fixed dataset $\mathcal{D}$ , the Bellman backup for a state-action pair $(s, a)$ requires evaluating $\max_{a'} Q(s', a')$ at the next state $s'$ . If the maximizing action $a'$ is out-of-distribution (rarely or never taken by $\pi_\beta$ at state $s'$ ), the Q-value $Q(s', a')$ is unconstrained by data and can be arbitrarily overestimated. This overestimation propagates through Bellman backups, causing the learned Q-function to assign high values to state-action pairs that are poorly supported by data.

Intuition

Q-learning bootstraps: it uses its own Q-value estimates to generate training targets. In online RL, the agent visits the overestimated states and gets corrected by real rewards. In offline RL, there is no such correction. The $\max$ operator in the Bellman backup actively seeks out-of-distribution actions (where Q is high due to extrapolation, not due to actual high reward), creating a self-reinforcing overestimation cycle.

Proof Sketch

Consider a tabular MDP where action $a^*$ is never taken at state $s'$ in $\mathcal{D}$ . The Q-value $Q(s', a^*)$ is initialized randomly and never updated by data (no $(s', a^*, \cdot, \cdot)$ transitions exist). If $Q(s', a^*) > Q(s', a)$ for all $a$ in the data, then $\max_{a'} Q(s', a') = Q(s', a^*)$ . This inflated target propagates backward to any $(s, a)$ that transitions to $s'$ . With function approximation, the problem is worse: overestimation at one state can leak to nearby states through generalization.

Why It Matters

This is the central failure mode of offline RL. Every successful offline RL algorithm addresses this problem, either by constraining the policy to stay near $\pi_\beta$ , penalizing out-of-distribution actions, or avoiding the Bellman backup entirely.

Failure Mode

The severity depends on dataset coverage. If $\pi_\beta$ is close to the optimal policy and covers most relevant state-action pairs, extrapolation error is small. If $\pi_\beta$ is a poor policy that barely explores, the learned Q-function is unreliable everywhere outside the data support.

report a correction →

Conservative Q-Learning (CQL)

CQL (Kumar et al., 2020) addresses extrapolation error by adding a regularizer that pushes down Q-values for out-of-distribution actions.

Theorem

CQL Lower Bound on Q-Values

Statement

CQL minimizes the objective:

$\min_Q \, \alpha \left( \mathbb{E}_{s \sim \mathcal{D}} \left[ \log \sum_a \exp(Q(s, a)) \right] - \mathbb{E}_{(s,a) \sim \mathcal{D}}[Q(s, a)] \right) + \frac{1}{2} \mathbb{E}_{(s,a,r,s') \sim \mathcal{D}} \left[ (Q(s,a) - \mathcal{B}^\pi \hat{Q}(s,a))^2 \right]$

where $\mathcal{B}^\pi$ is the Bellman operator. Under tabular or linear function approximation with sufficient regularization, the resulting Q-function satisfies the expected lower bound

$\mathbb{E}_{a \sim \mu(\cdot \mid s)}\!\left[\hat{Q}^{\text{CQL}}(s, a)\right] \;\leq\; \mathbb{E}_{a \sim \mu(\cdot \mid s)}\!\left[Q^\pi(s, a)\right] \quad \text{for all } s,$

for any evaluation policy $\mu$ . With sufficiently large $\alpha$ this can be strengthened to a pointwise lower bound for actions $a$ outside the data support, but the canonical CQL guarantee is the distributional one above rather than a blanket pointwise bound on every $(s, a)$ . The point is that policy evaluation is pessimistic in expectation rather than optimistic.

Intuition

CQL minimizes the regularizer, so the sign reads inside-out. The log-sum-exp $\log \sum_a \exp Q(s,a)$ is a soft-max over actions; minimizing it pushes Q-values down, with the largest pressure on whichever action currently has the highest Q-value. The second term, $-\mathbb{E}_{\mathcal{D}}[Q(s,a)]$ , pushes up Q-values for actions actually seen in the data. The net effect: in-distribution actions keep their Bellman-consistent Q-values, while out-of-distribution actions, which dominate the log-sum-exp without being supported by data, are driven down. The Bellman loss term ties everything back to observed transitions.

Proof Sketch

At the fixed point of CQL, the regularizer ensures that $\mathbb{E}_{\mu}[Q(s,a)] \leq \mathbb{E}_{\pi_\beta}[Q(s,a)]$ for any policy $\mu$ . Combined with the Bellman consistency constraint on in-distribution data, this yields a pointwise lower bound on the true Q-function.

Why It Matters

A pessimistic Q-function leads to a conservative policy: the agent only takes actions it is confident are good based on data. This avoids the catastrophic overestimation of naive offline Q-learning. The degree of pessimism is controlled by $\alpha$ .

Failure Mode

If $\alpha$ is too large, the Q-function is excessively pessimistic and the learned policy refuses to deviate from $\pi_\beta$ at all, learning nothing beyond the behavior policy. If $\alpha$ is too small, extrapolation error returns. There is no principled way to set $\alpha$ without environment interaction to validate.

report a correction →

Decision Transformer

Decision Transformer (Chen et al., 2021) bypasses Q-learning entirely. It frames RL as sequence modeling: given a desired return $\hat{R}$ , past states, actions, and rewards, predict the next action.

The model is a GPT-style transformer that takes as input the sequence $(R_1, s_1, a_1, R_2, s_2, a_2, \ldots)$ where $R_t$ is the return-to-go (sum of future rewards from step $t$ ). At test time, condition on a high target return to generate high-reward behavior.

Decision Transformer avoids bootstrapping entirely: no Bellman backup, no Q-values, no $\max$ operator. This sidesteps extrapolation error but introduces a different limitation: it can only produce behaviors that exist (or can be interpolated from) the dataset. It cannot "stitch" together suboptimal trajectories to find better-than-data policies, unlike Q-learning methods.

Common Confusions

Watch Out

Offline RL is not imitation learning

Imitation learning clones the behavior policy: it maximizes $\mathbb{E}_{\mathcal{D}}[\log \pi(a|s)]$ . Offline RL seeks a policy that is better policy than $\pi_\beta$ using the reward signal in the data. If $\pi_\beta$ is suboptimal, imitation learning inherits its mistakes. Offline RL methods like CQL can potentially improve upon $\pi_\beta$ by stitching together good sub-trajectories from different episodes.

Watch Out

More data does not always help in offline RL

In supervised learning, more data is almost always better. In offline RL, data from a poor behavior policy can hurt: it fills the dataset with low-reward transitions and makes the out-of-distribution problem worse for high-reward actions. The quality and coverage of the behavior policy matter more than dataset size.

Canonical Examples

Example

D4RL benchmark

The D4RL benchmark (Fu et al., 2020) provides offline datasets for MuJoCo locomotion tasks. The "medium" dataset contains transitions from a policy trained to 50% of expert performance. Naive DQN on this dataset achieves near-zero return due to extrapolation error. CQL achieves 50-80% of expert performance. BC (behavioral cloning) achieves roughly 50% (matching the data policy). The gap between BC and CQL demonstrates that offline RL can improve beyond the behavior policy.

Summary

Offline RL learns from fixed data; no environment interaction
The core challenge is distributional shift causing Q-value overestimation for out-of-distribution actions
CQL adds a pessimistic regularizer that lower-bounds the true Q-function
Decision Transformer frames RL as sequence modeling, avoiding Bellman backups
Q-learning methods can stitch trajectories; sequence modeling methods cannot
Dataset quality (coverage and optimality of $\pi_\beta$ ) is the critical factor

Exercises

ExerciseCore

Problem

Consider a 3-state MDP with 2 actions. The dataset contains only transitions where action $a_1$ is taken in all states. Explain why standard Q-learning will overestimate the value of action $a_2$ and how CQL prevents this.

ExerciseAdvanced

Problem

Decision Transformer conditions on a target return $\hat{R}$ to generate actions. Explain why conditioning on a very high target return (much higher than any trajectory in the dataset) does not reliably produce a high-return policy. What assumption about the dataset must hold for return conditioning to work?

References

Canonical:

Levine et al., "Offline Reinforcement Learning: Tutorial, Review, and Perspectives on Open Problems" (2020)
Kumar et al., "Conservative Q-Learning for Offline Reinforcement Learning" (NeurIPS 2020)

Current:

Chen et al., "Decision Transformer: Reinforcement Learning via Sequence Modeling" (NeurIPS 2021)
Fu et al., "D4RL: Datasets for Deep Data-Driven Reinforcement Learning" (2020)

Next Topics

Policy optimization (PPO/TRPO): online policy optimization that offline RL tries to replace
Agentic RL and tool use: applying RL (including offline) to LLM agents

Last reviewed: April 26, 2026

Canonical graph

Required before and derived from this topic

These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.

Full prerequisite chain All derived topics

Required prerequisites

1

Q-Learninglayer 2 · tier 1

Derived topics

2

Policy Optimization: PPO and TRPOlayer 3 · tier 2
Agentic RL and Tool Uselayer 5 · tier 2

Graph-backed continuations

Policy Optimization: PPO and TRPO Agentic RL and Tool Use