RL Theory
Offline Reinforcement Learning
Learning policies from a fixed dataset without environment interaction: distributional shift as the core challenge, conservative Q-learning (CQL) as the standard fix, and Decision Transformer as an alternative sequence modeling approach.
Prerequisites
Why This Matters
Standard RL (online RL) learns by interacting with an environment: take an action, observe the result, update the policy. This is impossible or dangerous in many settings. You cannot crash real cars to train a self-driving policy. You cannot give patients random treatments to learn a medical protocol.
Offline RL (also called batch RL) learns from a fixed dataset of transitions collected by some behavior policy. The learner never interacts with the environment. This sounds like supervised learning, but the key difference is that the learned policy may visit states and take actions not in the dataset, creating a distributional shift that causes catastrophic overestimation of Q-values.
Formal Setup
Offline RL Setting
Given a fixed dataset collected by a behavior policy , learn a policy that maximizes expected discounted return without any additional environment interaction.
The dataset distribution is , the state-action visitation frequency of . The learned policy induces a different distribution . When puts mass on pairs not covered by , we have distributional shift.
The Core Problem: Extrapolation Error
Extrapolation Error in Offline Q-Learning
Statement
When standard Q-learning is applied to a fixed dataset , the Bellman backup for a state-action pair requires evaluating at the next state . If the maximizing action is out-of-distribution (rarely or never taken by at state ), the Q-value is unconstrained by data and can be arbitrarily overestimated. This overestimation propagates through Bellman backups, causing the learned Q-function to assign high values to state-action pairs that are poorly supported by data.
Intuition
Q-learning bootstraps: it uses its own Q-value estimates to generate training targets. In online RL, the agent visits the overestimated states and gets corrected by real rewards. In offline RL, there is no such correction. The operator in the Bellman backup actively seeks out-of-distribution actions (where Q is high due to extrapolation, not due to actual high reward), creating a self-reinforcing overestimation cycle.
Proof Sketch
Consider a tabular MDP where action is never taken at state in . The Q-value is initialized randomly and never updated by data (no transitions exist). If for all in the data, then . This inflated target propagates backward to any that transitions to . With function approximation, the problem is worse: overestimation at one state can leak to nearby states through generalization.
Why It Matters
This is the central failure mode of offline RL. Every successful offline RL algorithm addresses this problem, either by constraining the policy to stay near , penalizing out-of-distribution actions, or avoiding the Bellman backup entirely.
Failure Mode
The severity depends on dataset coverage. If is close to the optimal policy and covers most relevant state-action pairs, extrapolation error is small. If is a poor policy that barely explores, the learned Q-function is unreliable everywhere outside the data support.
Conservative Q-Learning (CQL)
CQL (Kumar et al., 2020) addresses extrapolation error by adding a regularizer that pushes down Q-values for out-of-distribution actions.
CQL Lower Bound on Q-Values
Statement
CQL minimizes the objective:
where is the Bellman operator. Under tabular or linear function approximation, the resulting Q-function satisfies:
That is, CQL produces a lower bound on the true Q-values, ensuring that policy evaluation is pessimistic rather than optimistic.
Intuition
The first term is a soft-max over actions: it pushes up Q-values for the highest-valued actions. The second term pushes up Q-values for actions seen in data. The difference pushes down Q-values for actions not in the data. The Bellman loss term ensures consistency with observed transitions. The net effect: in-distribution actions have accurate Q-values, out-of-distribution actions have conservatively low Q-values.
Proof Sketch
At the fixed point of CQL, the regularizer ensures that for any policy . Combined with the Bellman consistency constraint on in-distribution data, this yields a pointwise lower bound on the true Q-function.
Why It Matters
A pessimistic Q-function leads to a conservative policy: the agent only takes actions it is confident are good based on data. This avoids the catastrophic overestimation of naive offline Q-learning. The degree of pessimism is controlled by .
Failure Mode
If is too large, the Q-function is excessively pessimistic and the learned policy refuses to deviate from at all, learning nothing beyond the behavior policy. If is too small, extrapolation error returns. There is no principled way to set without environment interaction to validate.
Decision Transformer
Decision Transformer (Chen et al., 2021) bypasses Q-learning entirely. It frames RL as sequence modeling: given a desired return , past states, actions, and rewards, predict the next action.
The model is a GPT-style transformer that takes as input the sequence where is the return-to-go (sum of future rewards from step ). At test time, condition on a high target return to generate high-reward behavior.
Decision Transformer avoids bootstrapping entirely: no Bellman backup, no Q-values, no operator. This sidesteps extrapolation error but introduces a different limitation: it can only produce behaviors that exist (or can be interpolated from) the dataset. It cannot "stitch" together suboptimal trajectories to find better-than-data policies, unlike Q-learning methods.
Common Confusions
Offline RL is not imitation learning
Imitation learning clones the behavior policy: it maximizes . Offline RL seeks a policy that is better policy than using the reward signal in the data. If is suboptimal, imitation learning inherits its mistakes. Offline RL methods like CQL can potentially improve upon by stitching together good sub-trajectories from different episodes.
More data does not always help in offline RL
In supervised learning, more data is almost always better. In offline RL, data from a poor behavior policy can hurt: it fills the dataset with low-reward transitions and makes the out-of-distribution problem worse for high-reward actions. The quality and coverage of the behavior policy matter more than dataset size.
Canonical Examples
D4RL benchmark
The D4RL benchmark (Fu et al., 2020) provides offline datasets for MuJoCo locomotion tasks. The "medium" dataset contains transitions from a policy trained to 50% of expert performance. Naive DQN on this dataset achieves near-zero return due to extrapolation error. CQL achieves 50-80% of expert performance. BC (behavioral cloning) achieves roughly 50% (matching the data policy). The gap between BC and CQL demonstrates that offline RL can improve beyond the behavior policy.
Key Takeaways
- Offline RL learns from fixed data; no environment interaction
- The core challenge is distributional shift causing Q-value overestimation for out-of-distribution actions
- CQL adds a pessimistic regularizer that lower-bounds the true Q-function
- Decision Transformer frames RL as sequence modeling, avoiding Bellman backups
- Q-learning methods can stitch trajectories; sequence modeling methods cannot
- Dataset quality (coverage and optimality of ) is the critical factor
Exercises
Problem
Consider a 3-state MDP with 2 actions. The dataset contains only transitions where action is taken in all states. Explain why standard Q-learning will overestimate the value of action and how CQL prevents this.
Problem
Decision Transformer conditions on a target return to generate actions. Explain why conditioning on a very high target return (much higher than any trajectory in the dataset) does not reliably produce a high-return policy. What assumption about the dataset must hold for return conditioning to work?
References
Canonical:
- Levine et al., "Offline Reinforcement Learning: Tutorial, Review, and Perspectives on Open Problems" (2020)
- Kumar et al., "Conservative Q-Learning for Offline Reinforcement Learning" (NeurIPS 2020)
Current:
- Chen et al., "Decision Transformer: Reinforcement Learning via Sequence Modeling" (NeurIPS 2021)
- Fu et al., "D4RL: Datasets for Deep Data-Driven Reinforcement Learning" (2020)
Next Topics
- Policy optimization (PPO/TRPO): online policy optimization that offline RL tries to replace
- Agentic RL and tool use: applying RL (including offline) to LLM agents
Last reviewed: April 2026
Prerequisites
Foundations this topic depends on.
- Q-LearningLayer 2
- Value Iteration and Policy IterationLayer 2
- Markov Decision ProcessesLayer 2
- Convex Optimization BasicsLayer 1
- Differentiation in RnLayer 0A
- Sets, Functions, and RelationsLayer 0A
- Basic Logic and Proof TechniquesLayer 0A
- Matrix Operations and PropertiesLayer 0A
- Concentration InequalitiesLayer 1
- Common Probability DistributionsLayer 0A
- Expectation, Variance, Covariance, and MomentsLayer 0A