RL Theory
Options and Temporal Abstraction
The options framework for hierarchical RL: temporally extended actions with initiation sets, internal policies, and termination conditions. Semi-MDPs and learning options end-to-end.
Why This Matters
Primitive actions in most MDPs operate at a single time scale: one step, one action. Complex tasks naturally decompose into subtasks that span many steps: "navigate to the door", "pick up the object", "stir the pot". The options framework (Sutton, Precup, Singh, 1999) formalizes temporally extended actions. An option is a policy that runs for multiple steps until a termination condition is met. Reasoning over options rather than primitive actions reduces the effective planning horizon and enables transfer of subtask solutions across different tasks.
Mental Model
Think of options as subroutines. A primitive action is a single instruction. An option is a function call: it takes over control, executes a sequence of primitive actions according to its internal policy, and returns control when its termination condition triggers. A policy over options is like a high-level program that calls subroutines. The agent plans at two levels: which option to invoke, and within each option, which primitive actions to take.
Formal Setup
Option
An option consists of three components:
- Initiation set : the set of states where the option can be started
- Internal policy : a policy over primitive actions that the option follows while active
- Termination function : the probability of terminating the option upon entering state
A primitive action is a special case: , the policy always selects , and for all (terminates after one step).
Semi-Markov Decision Process
A semi-MDP is an MDP where actions take variable amounts of time. When the agent selects option in state , the option runs for steps (a random variable depending on and ), accumulates discounted reward , and transitions to a new state . The SMDP transition model is:
Planning over options reduces the original MDP to an SMDP over the option set.
Policy Over Options
A policy over options selects which option to execute in each state where the previous option has terminated. The full hierarchical execution is: at state , select option , execute until termination, then select a new option in the resulting state.
Core Theory
Bellman Equation for Options
Statement
The value function over options satisfies the Bellman equation:
where is the expected discounted reward accumulated while executing option from state , is the multi-step transition probability, and:
The function captures the continuation value: with probability the option continues (value ), and with probability it terminates and a new option is chosen (value ).
Intuition
This is the standard Bellman equation, but actions are replaced by options that run for variable durations. The key difference is the function, which accounts for the option possibly continuing or terminating at the next state. When all options are primitive actions ( everywhere), and this reduces to the standard Bellman equation.
Proof Sketch
Condition on the first step of the option. The option takes action , transitions to , and then either terminates (probability ) or continues (probability ). If it terminates, the agent selects a new option optimally, giving value . If it continues, the option keeps running from , giving value . Unrolling this recursion and taking the max over available options gives the stated equation.
Why It Matters
This equation enables value iteration and policy iteration over options, reducing the planning problem from reasoning over sequences of primitive actions to reasoning over temporally extended options. If good options are available, the effective planning depth is reduced from the task horizon to the number of option invocations.
Failure Mode
The equation assumes the options are fixed. If the options are poorly designed (e.g., an option that wanders randomly for many steps), planning over them can be worse than planning over primitive actions. The equation also does not address how to discover good options. It only tells you how to plan with given options.
Option-Critic Architecture
The Option-Critic (Bacon, Harb, Precup, 2017) learns options end-to-end using policy gradient methods. The architecture has three learned components:
- Policy over options : selects which option to initiate
- Intra-option policies for each option : selects primitive actions
- Termination functions for each option : decides when to stop
All three are parameterized by neural networks and trained simultaneously. The key insight is that gradients for the termination function can be derived from the advantage of continuing the current option vs. switching:
If the current option has higher value than the average over options (), the gradient pushes termination probability down (keep going). If it has lower value (), push termination probability up (stop and switch).
The degenerate option problem. Without regularization, Option-Critic tends to learn trivial solutions: one option that does everything, or options that terminate immediately (reducing to primitive actions). Adding a termination penalty (small cost for switching options) or a deliberation cost encourages temporally extended, distinct options.
Why Temporal Abstraction Matters
Reduced planning depth. A task that requires 1000 primitive steps might require only 10 option invocations. Planning 10 steps ahead is tractable; planning 1000 is not.
Transfer and compositionality. A "navigate to room" option learned in one task can be reused in another task that also requires navigation. Options provide a natural unit of transfer.
Exploration. Temporally extended exploration (committing to a direction for many steps) connects to the exploration vs. exploitation trade-off and can be more effective than random single-step exploration for tasks with sparse rewards and bottleneck states.
Common Confusions
Options are not the same as macro-actions
Macro-actions are fixed sequences of primitive actions. Options are more general: they have stochastic internal policies that can react to the current state, and probabilistic termination conditions that adapt to the situation. An option for "navigate to the door" will take different actions depending on obstacles, while a macro-action would execute the same fixed sequence regardless.
Semi-MDPs do not require options
Any situation where actions take variable time can be modeled as a semi-MDP. Options are one way to construct the temporally extended actions, but semi-MDPs also arise naturally in queuing systems, inventory management, and any setting where events occur at irregular intervals.
Key Takeaways
- An option is a temporally extended action: initiation set + internal policy + termination condition
- Options reduce the original MDP to a semi-MDP over the option set
- The Bellman equation for options generalizes standard value iteration to variable-duration actions
- Option-Critic learns options end-to-end via policy gradients on the intra-option policy and termination function
- Temporal abstraction reduces planning depth, enables transfer, and improves exploration
- Without regularization, learned options tend to degenerate to trivial solutions
Exercises
Problem
Consider an MDP with states and an option with initiation set , internal policy that deterministically goes , and termination function , , . If each transition gives reward 1 and , what is the expected discounted reward of executing this option starting from state ?
Problem
In the Option-Critic framework, the termination gradient is proportional to . Explain why a pure advantage-based termination gradient, without any regularization, leads to degenerate options. Describe two regularization approaches and their tradeoffs.
References
Canonical:
- Sutton, Precup, Singh, Between MDPs and Semi-MDPs: A Framework for Temporal Abstraction in Reinforcement Learning, Artificial Intelligence (1999)
- Precup, Temporal Abstraction in Reinforcement Learning, PhD Thesis, University of Massachusetts (2000)
Current:
- Bacon, Harb, Precup, The Option-Critic Architecture, AAAI (2017)
- Riemer, Liu, Tesauro, Learning Abstract Options, NeurIPS (2018)
Next Topics
- Feudal networks and goal-conditioned hierarchical RL
- Skill discovery and unsupervised option learning
Last reviewed: April 2026
Prerequisites
Foundations this topic depends on.
- Markov Decision ProcessesLayer 2
- Convex Optimization BasicsLayer 1
- Differentiation in RnLayer 0A
- Sets, Functions, and RelationsLayer 0A
- Basic Logic and Proof TechniquesLayer 0A
- Matrix Operations and PropertiesLayer 0A
- Concentration InequalitiesLayer 1
- Common Probability DistributionsLayer 0A
- Expectation, Variance, Covariance, and MomentsLayer 0A
- Value Iteration and Policy IterationLayer 2