Beta. Content is under active construction and has not been peer-reviewed. Report errors on GitHub.Disclaimer

RL Theory

Offline Reinforcement Learning

Learning policies from a fixed dataset without environment interaction: distributional shift as the core challenge, conservative Q-learning (CQL) as the standard fix, and Decision Transformer as an alternative sequence modeling approach.

AdvancedTier 2Current~55 min

Prerequisites

0

Why This Matters

Standard RL (online RL) learns by interacting with an environment: take an action, observe the result, update the policy. This is impossible or dangerous in many settings. You cannot crash real cars to train a self-driving policy. You cannot give patients random treatments to learn a medical protocol.

Offline RL (also called batch RL) learns from a fixed dataset of (s,a,r,s)(s, a, r, s') transitions collected by some behavior policy. The learner never interacts with the environment. This sounds like supervised learning, but the key difference is that the learned policy may visit states and take actions not in the dataset, creating a distributional shift that causes catastrophic overestimation of Q-values.

Formal Setup

Definition

Offline RL Setting

Given a fixed dataset D={(si,ai,ri,si)}i=1N\mathcal{D} = \{(s_i, a_i, r_i, s'_i)\}_{i=1}^N collected by a behavior policy πβ\pi_\beta, learn a policy π\pi that maximizes expected discounted return J(π)=Eπ[t=0γtrt]J(\pi) = \mathbb{E}_\pi[\sum_{t=0}^\infty \gamma^t r_t] without any additional environment interaction.

The dataset distribution is dπβ(s,a)d^{\pi_\beta}(s, a), the state-action visitation frequency of πβ\pi_\beta. The learned policy π\pi induces a different distribution dπ(s,a)d^\pi(s, a). When dπd^\pi puts mass on (s,a)(s, a) pairs not covered by D\mathcal{D}, we have distributional shift.

The Core Problem: Extrapolation Error

Proposition

Extrapolation Error in Offline Q-Learning

Statement

When standard Q-learning is applied to a fixed dataset D\mathcal{D}, the Bellman backup for a state-action pair (s,a)(s, a) requires evaluating maxaQ(s,a)\max_{a'} Q(s', a') at the next state ss'. If the maximizing action aa' is out-of-distribution (rarely or never taken by πβ\pi_\beta at state ss'), the Q-value Q(s,a)Q(s', a') is unconstrained by data and can be arbitrarily overestimated. This overestimation propagates through Bellman backups, causing the learned Q-function to assign high values to state-action pairs that are poorly supported by data.

Intuition

Q-learning bootstraps: it uses its own Q-value estimates to generate training targets. In online RL, the agent visits the overestimated states and gets corrected by real rewards. In offline RL, there is no such correction. The max\max operator in the Bellman backup actively seeks out-of-distribution actions (where Q is high due to extrapolation, not due to actual high reward), creating a self-reinforcing overestimation cycle.

Proof Sketch

Consider a tabular MDP where action aa^* is never taken at state ss' in D\mathcal{D}. The Q-value Q(s,a)Q(s', a^*) is initialized randomly and never updated by data (no (s,a,,)(s', a^*, \cdot, \cdot) transitions exist). If Q(s,a)>Q(s,a)Q(s', a^*) > Q(s', a) for all aa in the data, then maxaQ(s,a)=Q(s,a)\max_{a'} Q(s', a') = Q(s', a^*). This inflated target propagates backward to any (s,a)(s, a) that transitions to ss'. With function approximation, the problem is worse: overestimation at one state can leak to nearby states through generalization.

Why It Matters

This is the central failure mode of offline RL. Every successful offline RL algorithm addresses this problem, either by constraining the policy to stay near πβ\pi_\beta, penalizing out-of-distribution actions, or avoiding the Bellman backup entirely.

Failure Mode

The severity depends on dataset coverage. If πβ\pi_\beta is close to the optimal policy and covers most relevant state-action pairs, extrapolation error is small. If πβ\pi_\beta is a poor policy that barely explores, the learned Q-function is unreliable everywhere outside the data support.

Conservative Q-Learning (CQL)

CQL (Kumar et al., 2020) addresses extrapolation error by adding a regularizer that pushes down Q-values for out-of-distribution actions.

Theorem

CQL Lower Bound on Q-Values

Statement

CQL minimizes the objective:

minQα(EsD[logaexp(Q(s,a))]E(s,a)D[Q(s,a)])+12E(s,a,r,s)D[(Q(s,a)BπQ^(s,a))2]\min_Q \, \alpha \left( \mathbb{E}_{s \sim \mathcal{D}} \left[ \log \sum_a \exp(Q(s, a)) \right] - \mathbb{E}_{(s,a) \sim \mathcal{D}}[Q(s, a)] \right) + \frac{1}{2} \mathbb{E}_{(s,a,r,s') \sim \mathcal{D}} \left[ (Q(s,a) - \mathcal{B}^\pi \hat{Q}(s,a))^2 \right]

where Bπ\mathcal{B}^\pi is the Bellman operator. Under tabular or linear function approximation, the resulting Q-function satisfies:

Q^CQL(s,a)Qπ(s,a)for all (s,a)\hat{Q}^{\text{CQL}}(s, a) \leq Q^\pi(s, a) \quad \text{for all } (s, a)

That is, CQL produces a lower bound on the true Q-values, ensuring that policy evaluation is pessimistic rather than optimistic.

Intuition

The first term logaexpQ(s,a)\log \sum_a \exp Q(s,a) is a soft-max over actions: it pushes up Q-values for the highest-valued actions. The second term ED[Q(s,a)]\mathbb{E}_{\mathcal{D}}[Q(s,a)] pushes up Q-values for actions seen in data. The difference pushes down Q-values for actions not in the data. The Bellman loss term ensures consistency with observed transitions. The net effect: in-distribution actions have accurate Q-values, out-of-distribution actions have conservatively low Q-values.

Proof Sketch

At the fixed point of CQL, the regularizer ensures that Eμ[Q(s,a)]Eπβ[Q(s,a)]\mathbb{E}_{\mu}[Q(s,a)] \leq \mathbb{E}_{\pi_\beta}[Q(s,a)] for any policy μ\mu. Combined with the Bellman consistency constraint on in-distribution data, this yields a pointwise lower bound on the true Q-function.

Why It Matters

A pessimistic Q-function leads to a conservative policy: the agent only takes actions it is confident are good based on data. This avoids the catastrophic overestimation of naive offline Q-learning. The degree of pessimism is controlled by α\alpha.

Failure Mode

If α\alpha is too large, the Q-function is excessively pessimistic and the learned policy refuses to deviate from πβ\pi_\beta at all, learning nothing beyond the behavior policy. If α\alpha is too small, extrapolation error returns. There is no principled way to set α\alpha without environment interaction to validate.

Decision Transformer

Decision Transformer (Chen et al., 2021) bypasses Q-learning entirely. It frames RL as sequence modeling: given a desired return R^\hat{R}, past states, actions, and rewards, predict the next action.

The model is a GPT-style transformer that takes as input the sequence (R1,s1,a1,R2,s2,a2,)(R_1, s_1, a_1, R_2, s_2, a_2, \ldots) where RtR_t is the return-to-go (sum of future rewards from step tt). At test time, condition on a high target return to generate high-reward behavior.

Decision Transformer avoids bootstrapping entirely: no Bellman backup, no Q-values, no max\max operator. This sidesteps extrapolation error but introduces a different limitation: it can only produce behaviors that exist (or can be interpolated from) the dataset. It cannot "stitch" together suboptimal trajectories to find better-than-data policies, unlike Q-learning methods.

Common Confusions

Watch Out

Offline RL is not imitation learning

Imitation learning clones the behavior policy: it maximizes ED[logπ(as)]\mathbb{E}_{\mathcal{D}}[\log \pi(a|s)]. Offline RL seeks a policy that is better policy than πβ\pi_\beta using the reward signal in the data. If πβ\pi_\beta is suboptimal, imitation learning inherits its mistakes. Offline RL methods like CQL can potentially improve upon πβ\pi_\beta by stitching together good sub-trajectories from different episodes.

Watch Out

More data does not always help in offline RL

In supervised learning, more data is almost always better. In offline RL, data from a poor behavior policy can hurt: it fills the dataset with low-reward transitions and makes the out-of-distribution problem worse for high-reward actions. The quality and coverage of the behavior policy matter more than dataset size.

Canonical Examples

Example

D4RL benchmark

The D4RL benchmark (Fu et al., 2020) provides offline datasets for MuJoCo locomotion tasks. The "medium" dataset contains transitions from a policy trained to 50% of expert performance. Naive DQN on this dataset achieves near-zero return due to extrapolation error. CQL achieves 50-80% of expert performance. BC (behavioral cloning) achieves roughly 50% (matching the data policy). The gap between BC and CQL demonstrates that offline RL can improve beyond the behavior policy.

Key Takeaways

  • Offline RL learns from fixed data; no environment interaction
  • The core challenge is distributional shift causing Q-value overestimation for out-of-distribution actions
  • CQL adds a pessimistic regularizer that lower-bounds the true Q-function
  • Decision Transformer frames RL as sequence modeling, avoiding Bellman backups
  • Q-learning methods can stitch trajectories; sequence modeling methods cannot
  • Dataset quality (coverage and optimality of πβ\pi_\beta) is the critical factor

Exercises

ExerciseCore

Problem

Consider a 3-state MDP with 2 actions. The dataset contains only transitions where action a1a_1 is taken in all states. Explain why standard Q-learning will overestimate the value of action a2a_2 and how CQL prevents this.

ExerciseAdvanced

Problem

Decision Transformer conditions on a target return R^\hat{R} to generate actions. Explain why conditioning on a very high target return (much higher than any trajectory in the dataset) does not reliably produce a high-return policy. What assumption about the dataset must hold for return conditioning to work?

References

Canonical:

  • Levine et al., "Offline Reinforcement Learning: Tutorial, Review, and Perspectives on Open Problems" (2020)
  • Kumar et al., "Conservative Q-Learning for Offline Reinforcement Learning" (NeurIPS 2020)

Current:

  • Chen et al., "Decision Transformer: Reinforcement Learning via Sequence Modeling" (NeurIPS 2021)
  • Fu et al., "D4RL: Datasets for Deep Data-Driven Reinforcement Learning" (2020)

Next Topics

  • Policy optimization (PPO/TRPO): online policy optimization that offline RL tries to replace
  • Agentic RL and tool use: applying RL (including offline) to LLM agents

Last reviewed: April 2026

Prerequisites

Foundations this topic depends on.

Next Topics