Beta. Content is under active construction and has not been peer-reviewed. Report errors on GitHub.Disclaimer

RL Theory

Active SLAM and POMDPs

Choosing robot actions to simultaneously map an environment and localize, formulated as a partially observable Markov decision process over belief states.

AdvancedTier 3Current~50 min
0

Why This Matters

Standard SLAM treats the robot as a passive observer: it receives sensor data from whatever trajectory it happens to follow and builds the best map it can. Active SLAM asks a different question: what actions should the robot take to build the best possible map? This transforms SLAM from an estimation problem into a decision problem. The natural framework is the POMDP, where the robot plans over belief states (probability distributions over its unknown pose and map).

Mental Model

The robot does not know its exact position or the full map. It maintains a belief: a probability distribution over all possible (pose, map) pairs. Each action the robot takes produces a new observation, which updates the belief. Some actions are more informative than others. Moving toward an unexplored corridor reduces map uncertainty. Revisiting a known landmark reduces pose uncertainty. The robot must balance exploring new regions (reducing map uncertainty) against exploiting known landmarks (reducing localization uncertainty). This is the exploration-exploitation tradeoff applied to spatial reasoning.

Formal Setup

Definition

POMDP

A partially observable Markov decision process consists of: state space SS, action space AA, transition model T(ss,a)T(s' \mid s, a), reward function R(s,a)R(s, a), observation space Ω\Omega, observation model O(os,a)O(o \mid s', a), and discount factor γ[0,1)\gamma \in [0,1). The agent does not observe the state ss directly; it receives observations oΩo \in \Omega generated stochastically from the true state.

In Active SLAM, the state s=(x,m)s = (x, m) consists of the robot pose xx and the map mm. The transition model describes how the robot moves (with noise). The observation model describes sensor measurements given the pose and map. The reward function encodes the objective: reduce uncertainty, maximize coverage, or reach a goal.

Definition

Belief State

The belief state bb is a probability distribution over states: b(s)=P(sa1,o1,,at,ot)b(s) = P(s \mid a_1, o_1, \ldots, a_t, o_t). It is a sufficient statistic for the history of actions and observations. For Active SLAM with Gaussian assumptions, the belief is parameterized by a mean vector μ\mu and covariance matrix Σ\Sigma over the joint (pose, map) state.

The belief update follows Bayes rule. After taking action aa and receiving observation oo:

b(s)=O(os,a)sT(ss,a)b(s)P(ob,a)b'(s') = \frac{O(o \mid s', a) \sum_{s} T(s' \mid s, a)\, b(s)}{P(o \mid b, a)}

This is the standard POMDP belief update, which for Gaussian beliefs reduces to the extended Kalman filter prediction and update steps.

Core Theory

Theorem

Bellman Equation in Belief Space

Statement

The optimal value function over belief states satisfies:

V(b)=maxaA[R(b,a)+γoΩP(ob,a)V(τ(b,a,o))]V^*(b) = \max_{a \in A} \left[ R(b, a) + \gamma \sum_{o \in \Omega} P(o \mid b, a)\, V^*(\tau(b, a, o)) \right]

where R(b,a)=sb(s)R(s,a)R(b, a) = \sum_s b(s)\, R(s, a) is the expected immediate reward, P(ob,a)=sO(os,a)sT(ss,a)b(s)P(o \mid b, a) = \sum_{s'} O(o \mid s', a) \sum_s T(s' \mid s, a)\, b(s) is the observation probability, and τ(b,a,o)\tau(b, a, o) is the updated belief after action aa and observation oo.

Intuition

A POMDP over states is equivalent to a (fully observable) MDP over beliefs. The belief state captures everything the agent knows. The Bellman equation in belief space has the same structure as the standard Bellman equation, but the "state" is now a probability distribution. The expectation over observations reflects that the agent does not know which observation it will receive.

Proof Sketch

Since the belief bb is a sufficient statistic for the history, the optimal policy depends on the history only through bb. The POMDP therefore reduces to a belief-MDP with transition kernel τ\tau and reward R(b,a)R(b,a). Apply the standard Bellman optimality theorem to this belief-MDP.

Why It Matters

This result converts Active SLAM from a problem with hidden state into a planning problem in belief space. In principle, any MDP planning algorithm (value iteration, policy gradient) can be applied. In practice, the belief space is infinite-dimensional, so approximations are necessary.

Failure Mode

The belief space is continuous and high-dimensional (for SLAM, the belief includes uncertainty over the entire map). Exact solutions are intractable except for very small problems. Practical Active SLAM systems use heuristics such as planning one step ahead with information-theoretic objectives, or sampling a finite set of candidate actions.

Information-Theoretic Objectives

Since exact POMDP solutions are intractable for SLAM, practitioners often use greedy or short-horizon approximations with information-theoretic reward functions.

Definition

Shannon Entropy of Belief

The entropy of a belief bb is H(b)=sb(s)logb(s)H(b) = -\sum_s b(s) \log b(s) (discrete) or H(b)=b(s)logb(s)dsH(b) = -\int b(s) \log b(s)\, ds (continuous). For a Gaussian belief with covariance Σ\Sigma, the differential entropy is H(b)=12logdet(2πeΣ)H(b) = \frac{1}{2} \log \det(2\pi e\, \Sigma). Reducing entropy corresponds to reducing uncertainty about the state.

A common Active SLAM objective is to choose the action aa^* that maximizes expected information gain:

a=argmaxa  Eob,a[H(b)H(τ(b,a,o))]a^* = \arg\max_a \; \mathbb{E}_{o \mid b, a}\left[H(b) - H(\tau(b, a, o))\right]

This is the mutual information between the future observation and the state, conditioned on the current belief and action. For Gaussian beliefs, the expected information gain depends on how much the covariance Σ\Sigma shrinks after incorporating the expected observation.

Exploration vs. Exploitation

Active SLAM exhibits a specific form of the exploration-exploitation tradeoff. Exploring unknown regions adds new landmarks to the map but increases pose uncertainty (the robot moves away from known landmarks). Exploiting known landmarks improves localization but does not expand the map. Good Active SLAM policies interleave: explore to extend the map, then revisit known areas to tighten the pose estimate before exploring again.

This mirrors the exploration-exploitation tradeoff in reinforcement learning, but the "reward" is information rather than task performance. In multi-objective Active SLAM, the robot also has a task goal (reach a destination, inspect an area), and must balance task progress against information gathering.

Common Confusions

Watch Out

POMDPs are not just MDPs with noise

Adding observation noise to an MDP does not make it a POMDP in any useful sense. The key feature of a POMDP is that the agent must maintain a belief state and plan over it. An agent that ignores its uncertainty and treats the most likely state as the true state (certainty-equivalent control) can perform arbitrarily badly in POMDPs. The belief state is not optional.

Watch Out

Information gain is not the same as uncertainty reduction in the map alone

Active SLAM must reduce uncertainty in both the pose and the map jointly. A policy that minimizes map entropy while ignoring pose uncertainty will produce overconfident maps with incorrect geometry. The joint covariance over (pose, map) is what matters.

Key Takeaways

  • Active SLAM selects actions to maximize mapping and localization quality
  • The POMDP formulation captures partial observability: the robot does not know its exact pose or the full map
  • Belief states (distributions over possible states) are the natural planning space
  • The Bellman equation applies in belief space but is intractable for realistic SLAM problems
  • Greedy information-theoretic objectives (maximize expected information gain) are the practical workhorse
  • The exploration-exploitation tradeoff in Active SLAM mirrors the one in RL but concerns spatial information

Exercises

ExerciseCore

Problem

Consider a robot on a 1D line with two possible positions: x=0x = 0 or x=1x = 1. The belief is b=(p,1p)b = (p, 1-p) where p=P(x=0)p = P(x=0). The robot can take action "stay" (no movement) or "move" (deterministically moves to the other position). There is a landmark at x=0x = 0 that produces observation o=1o = 1 with probability 0.9 if the robot is at x=0x = 0, and o=1o = 1 with probability 0.1 if at x=1x = 1. Starting from b=(0.5,0.5)b = (0.5, 0.5), compute the expected entropy of the belief after taking action "stay" and receiving an observation.

ExerciseAdvanced

Problem

In a Gaussian Active SLAM system, the joint state is s=(x,m)Rds = (x, m) \in \mathbb{R}^d with belief N(μ,Σ)\mathcal{N}(\mu, \Sigma). A candidate action produces a linear observation z=Hs+ηz = Hs + \eta where ηN(0,R)\eta \sim \mathcal{N}(0, R). Show that the expected information gain is 12logdet(Σ)12logdet(ΣΣH(HΣH+R)1HΣ)\frac{1}{2} \log \det(\Sigma) - \frac{1}{2} \log \det(\Sigma - \Sigma H^\top (H \Sigma H^\top + R)^{-1} H \Sigma) and explain why this depends only on Σ\Sigma, HH, and RR, not on μ\mu.

References

Canonical:

  • Kaelbling, Littman, Cassandra, Planning and Acting in Partially Observable Stochastic Domains, Artificial Intelligence (1998)
  • Stachniss, Grisetti, Burgard, Information Gain-based Exploration Using Rao-Blackwellized Particle Filters, RSS (2005)

Current:

  • Placed et al., A Survey on Active Simultaneous Localization and Mapping, Robotics and Autonomous Systems (2023)
  • Lluvia, Lazkano, Ansuategui, Active Mapping and Robot Exploration: A Survey, Sensors (2021)

Next Topics

  • Further research in this area connects to multi-robot exploration and deep RL for navigation

Last reviewed: April 2026

Prerequisites

Foundations this topic depends on.