Beta. Content is under active construction and has not been peer-reviewed. Report errors on GitHub.Disclaimer

RL Theory

Options and Temporal Abstraction

The options framework for hierarchical RL: temporally extended actions with initiation sets, internal policies, and termination conditions. Semi-MDPs and learning options end-to-end.

AdvancedTier 3Stable~50 min
0

Why This Matters

Primitive actions in most MDPs operate at a single time scale: one step, one action. Complex tasks naturally decompose into subtasks that span many steps: "navigate to the door", "pick up the object", "stir the pot". The options framework (Sutton, Precup, Singh, 1999) formalizes temporally extended actions. An option is a policy that runs for multiple steps until a termination condition is met. Reasoning over options rather than primitive actions reduces the effective planning horizon and enables transfer of subtask solutions across different tasks.

Mental Model

Think of options as subroutines. A primitive action is a single instruction. An option is a function call: it takes over control, executes a sequence of primitive actions according to its internal policy, and returns control when its termination condition triggers. A policy over options is like a high-level program that calls subroutines. The agent plans at two levels: which option to invoke, and within each option, which primitive actions to take.

Formal Setup

Definition

Option

An option ω\omega consists of three components:

  • Initiation set IωS\mathcal{I}_\omega \subseteq S: the set of states where the option can be started
  • Internal policy πω(as)\pi_\omega(a \mid s): a policy over primitive actions that the option follows while active
  • Termination function βω(s)[0,1]\beta_\omega(s) \in [0,1]: the probability of terminating the option upon entering state ss

A primitive action aa is a special case: I=S\mathcal{I} = S, the policy always selects aa, and β(s)=1\beta(s) = 1 for all ss (terminates after one step).

Definition

Semi-Markov Decision Process

A semi-MDP is an MDP where actions take variable amounts of time. When the agent selects option ω\omega in state ss, the option runs for kk steps (a random variable depending on πω\pi_\omega and βω\beta_\omega), accumulates discounted reward r=t=0k1γtrtr = \sum_{t=0}^{k-1} \gamma^t r_t, and transitions to a new state ss'. The SMDP transition model is:

P(s,ks,ω)=probability of reaching s in k steps under option ω from sP(s', k \mid s, \omega) = \text{probability of reaching } s' \text{ in } k \text{ steps under option } \omega \text{ from } s

Planning over options reduces the original MDP to an SMDP over the option set.

Definition

Policy Over Options

A policy over options πΩ(ωs)\pi_\Omega(\omega \mid s) selects which option to execute in each state where the previous option has terminated. The full hierarchical execution is: at state ss, select option ωπΩ(s)\omega \sim \pi_\Omega(\cdot \mid s), execute πω\pi_\omega until termination, then select a new option in the resulting state.

Core Theory

Theorem

Bellman Equation for Options

Statement

The value function over options satisfies the Bellman equation:

VΩ(s)=maxω:sIω[r(s,ω)+sP(ss,ω)U(s)]V_\Omega(s) = \max_{\omega: s \in \mathcal{I}_\omega} \left[ r(s, \omega) + \sum_{s'} P(s' \mid s, \omega)\, U(s') \right]

where r(s,ω)r(s, \omega) is the expected discounted reward accumulated while executing option ω\omega from state ss, P(ss,ω)P(s' \mid s, \omega) is the multi-step transition probability, and:

U(s)=(1βω(s))QΩ(s,ω)+βω(s)VΩ(s)U(s') = (1 - \beta_\omega(s'))\, Q_\Omega(s', \omega) + \beta_\omega(s')\, V_\Omega(s')

The function U(s)U(s') captures the continuation value: with probability 1βω(s)1 - \beta_\omega(s') the option continues (value QΩ(s,ω)Q_\Omega(s', \omega)), and with probability βω(s)\beta_\omega(s') it terminates and a new option is chosen (value VΩ(s)V_\Omega(s')).

Intuition

This is the standard Bellman equation, but actions are replaced by options that run for variable durations. The key difference is the UU function, which accounts for the option possibly continuing or terminating at the next state. When all options are primitive actions (β=1\beta = 1 everywhere), U(s)=VΩ(s)U(s') = V_\Omega(s') and this reduces to the standard Bellman equation.

Proof Sketch

Condition on the first step of the option. The option takes action aπω(s)a \sim \pi_\omega(\cdot \mid s), transitions to ss', and then either terminates (probability βω(s)\beta_\omega(s')) or continues (probability 1βω(s)1 - \beta_\omega(s')). If it terminates, the agent selects a new option optimally, giving value VΩ(s)V_\Omega(s'). If it continues, the option keeps running from ss', giving value QΩ(s,ω)Q_\Omega(s', \omega). Unrolling this recursion and taking the max over available options gives the stated equation.

Why It Matters

This equation enables value iteration and policy iteration over options, reducing the planning problem from reasoning over sequences of primitive actions to reasoning over temporally extended options. If good options are available, the effective planning depth is reduced from the task horizon to the number of option invocations.

Failure Mode

The equation assumes the options are fixed. If the options are poorly designed (e.g., an option that wanders randomly for many steps), planning over them can be worse than planning over primitive actions. The equation also does not address how to discover good options. It only tells you how to plan with given options.

Option-Critic Architecture

The Option-Critic (Bacon, Harb, Precup, 2017) learns options end-to-end using policy gradient methods. The architecture has three learned components:

  1. Policy over options πΩ(ωs)\pi_\Omega(\omega \mid s): selects which option to initiate
  2. Intra-option policies πω(as)\pi_\omega(a \mid s) for each option ω\omega: selects primitive actions
  3. Termination functions βω(s)\beta_\omega(s) for each option ω\omega: decides when to stop

All three are parameterized by neural networks and trained simultaneously. The key insight is that gradients for the termination function can be derived from the advantage of continuing the current option vs. switching:

θβω(s)AΩ(s,ω)=QΩ(s,ω)VΩ(s)-\nabla_\theta \beta_\omega(s) \propto A_\Omega(s, \omega) = Q_\Omega(s, \omega) - V_\Omega(s)

If the current option has higher value than the average over options (AΩ>0A_\Omega > 0), the gradient pushes termination probability down (keep going). If it has lower value (AΩ<0A_\Omega < 0), push termination probability up (stop and switch).

The degenerate option problem. Without regularization, Option-Critic tends to learn trivial solutions: one option that does everything, or options that terminate immediately (reducing to primitive actions). Adding a termination penalty (small cost for switching options) or a deliberation cost encourages temporally extended, distinct options.

Why Temporal Abstraction Matters

Reduced planning depth. A task that requires 1000 primitive steps might require only 10 option invocations. Planning 10 steps ahead is tractable; planning 1000 is not.

Transfer and compositionality. A "navigate to room" option learned in one task can be reused in another task that also requires navigation. Options provide a natural unit of transfer.

Exploration. Temporally extended exploration (committing to a direction for many steps) connects to the exploration vs. exploitation trade-off and can be more effective than random single-step exploration for tasks with sparse rewards and bottleneck states.

Common Confusions

Watch Out

Options are not the same as macro-actions

Macro-actions are fixed sequences of primitive actions. Options are more general: they have stochastic internal policies that can react to the current state, and probabilistic termination conditions that adapt to the situation. An option for "navigate to the door" will take different actions depending on obstacles, while a macro-action would execute the same fixed sequence regardless.

Watch Out

Semi-MDPs do not require options

Any situation where actions take variable time can be modeled as a semi-MDP. Options are one way to construct the temporally extended actions, but semi-MDPs also arise naturally in queuing systems, inventory management, and any setting where events occur at irregular intervals.

Key Takeaways

  • An option is a temporally extended action: initiation set + internal policy + termination condition
  • Options reduce the original MDP to a semi-MDP over the option set
  • The Bellman equation for options generalizes standard value iteration to variable-duration actions
  • Option-Critic learns options end-to-end via policy gradients on the intra-option policy and termination function
  • Temporal abstraction reduces planning depth, enables transfer, and improves exploration
  • Without regularization, learned options tend to degenerate to trivial solutions

Exercises

ExerciseCore

Problem

Consider an MDP with states {A,B,C,D}\{A, B, C, D\} and an option ω\omega with initiation set {A,B}\{A, B\}, internal policy that deterministically goes ABCA \to B \to C, and termination function β(A)=0\beta(A) = 0, β(B)=0\beta(B) = 0, β(C)=1\beta(C) = 1. If each transition gives reward 1 and γ=0.9\gamma = 0.9, what is the expected discounted reward of executing this option starting from state AA?

ExerciseAdvanced

Problem

In the Option-Critic framework, the termination gradient is proportional to AΩ(s,ω)=QΩ(s,ω)VΩ(s)A_\Omega(s, \omega) = Q_\Omega(s, \omega) - V_\Omega(s). Explain why a pure advantage-based termination gradient, without any regularization, leads to degenerate options. Describe two regularization approaches and their tradeoffs.

References

Canonical:

  • Sutton, Precup, Singh, Between MDPs and Semi-MDPs: A Framework for Temporal Abstraction in Reinforcement Learning, Artificial Intelligence (1999)
  • Precup, Temporal Abstraction in Reinforcement Learning, PhD Thesis, University of Massachusetts (2000)

Current:

  • Bacon, Harb, Precup, The Option-Critic Architecture, AAAI (2017)
  • Riemer, Liu, Tesauro, Learning Abstract Options, NeurIPS (2018)

Next Topics

  • Feudal networks and goal-conditioned hierarchical RL
  • Skill discovery and unsupervised option learning

Last reviewed: April 2026

Prerequisites

Foundations this topic depends on.