Active SLAM and POMDPs

Sneiderman, Robby

Applied Math

Active SLAM and POMDPs

Choosing robot actions to simultaneously map an environment and localize, formulated as a partially observable Markov decision process over belief states.

AdvancedTier 3CurrentSupporting~50 min

Prerequisites

Graphslam and Factor Graphs Markov Decision Processes Visual Semantic Slam

Prereq Map

Why This Matters

Standard SLAM treats the robot as a passive observer: it receives sensor data from whatever trajectory it happens to follow and builds the best map it can. Active SLAM asks a different question: what actions should the robot take to build the best possible map? This transforms SLAM from an estimation problem into a decision problem. The natural framework is the POMDP, where the robot plans over belief states (probability distributions over its unknown pose and map).

Mental Model

The robot does not know its exact position or the full map. It maintains a belief: a probability distribution over all possible (pose, map) pairs. Each action the robot takes produces a new observation, which updates the belief. Some actions are more informative than others. Moving toward an unexplored corridor reduces map uncertainty. Revisiting a known landmark reduces pose uncertainty. The robot must balance exploring new regions (reducing map uncertainty) against exploiting known landmarks (reducing localization uncertainty). This is the exploration-exploitation tradeoff applied to spatial reasoning.

Formal Setup

Definition

POMDP $(S, A, T, R, Ω, O, γ)$

A partially observable Markov decision process consists of: state space $S$ , action space $A$ , transition model $T(s' \mid s, a)$ , reward function $R(s, a)$ , observation space $\Omega$ , observation model $O(o \mid s', a)$ , and discount factor $\gamma \in [0,1)$ . The agent does not observe the state $s$ directly; it receives observations $o \in \Omega$ generated stochastically from the true state.

In Active SLAM, the state $s = (x, m)$ consists of the robot pose $x$ and the map $m$ . The transition model describes how the robot moves (with noise). The observation model describes sensor measurements given the pose and map. The reward function encodes the objective: reduce uncertainty, maximize coverage, or reach a goal.

Definition

Belief State $b (s)$

The belief state $b$ is a probability distribution over states: $b(s) = P(s \mid a_1, o_1, \ldots, a_t, o_t)$ . It is a sufficient statistic for the history of actions and observations. For Active SLAM with Gaussian assumptions, the belief is parameterized by a mean vector $\mu$ and covariance matrix $\Sigma$ over the joint (pose, map) state.

The belief update follows Bayes rule. After taking action $a$ and receiving observation $o$ :

$b'(s') = \frac{O(o \mid s', a) \sum_{s} T(s' \mid s, a)\, b(s)}{P(o \mid b, a)}$

This is the standard POMDP belief update, which for Gaussian beliefs reduces to the extended Kalman filter prediction and update steps.

Core Theory

Theorem

Bellman Equation in Belief Space

Statement

The optimal value function over belief states satisfies:

$V^*(b) = \max_{a \in A} \left[ R(b, a) + \gamma \sum_{o \in \Omega} P(o \mid b, a)\, V^*(\tau(b, a, o)) \right]$

where $R(b, a) = \sum_s b(s)\, R(s, a)$ is the expected immediate reward, $P(o \mid b, a) = \sum_{s'} O(o \mid s', a) \sum_s T(s' \mid s, a)\, b(s)$ is the observation probability, and $\tau(b, a, o)$ is the updated belief after action $a$ and observation $o$ .

Intuition

A POMDP over states is equivalent to a (fully observable) MDP over beliefs. The belief state captures everything the agent knows. The Bellman equation in belief space has the same structure as the standard Bellman equation, but the "state" is now a probability distribution. The expectation over observations reflects that the agent does not know which observation it will receive.

Proof Sketch

Since the belief $b$ is a sufficient statistic for the history, the optimal policy depends on the history only through $b$ . The POMDP therefore reduces to a belief-MDP with transition kernel $\tau$ and reward $R(b,a)$ . Apply the standard Bellman optimality theorem to this belief-MDP.

Why It Matters

This result converts Active SLAM from a problem with hidden state into a planning problem in belief space. In principle, any MDP planning algorithm (value iteration, policy gradient) can be applied. In practice, the belief space is infinite-dimensional, so approximations are necessary.

Failure Mode

The belief space is continuous and high-dimensional (for SLAM, the belief includes uncertainty over the entire map). Exact solutions are intractable except for very small problems. Practical Active SLAM systems use heuristics such as planning one step ahead with information-theoretic objectives, or sampling a finite set of candidate actions.

report a correction →

Information-Theoretic Objectives

Since exact POMDP solutions are intractable for SLAM, practitioners often use greedy or short-horizon approximations with information-theoretic reward functions.

Definition

Shannon Entropy of Belief $H (b)$

The entropy of a belief $b$ is $H(b) = -\sum_s b(s) \log b(s)$ (discrete) or $H(b) = -\int b(s) \log b(s)\, ds$ (continuous). For a Gaussian belief with covariance $\Sigma$ , the differential entropy is $H(b) = \frac{1}{2} \log \det(2\pi e\, \Sigma)$ . Reducing entropy corresponds to reducing uncertainty about the state.

A common Active SLAM objective is to choose the action $a^*$ that maximizes expected information gain:

$a^* = \arg\max_a \; \mathbb{E}_{o \mid b, a}\left[H(b) - H(\tau(b, a, o))\right]$

This is the mutual information between the future observation and the state, conditioned on the current belief and action. For Gaussian beliefs, the expected information gain depends on how much the covariance $\Sigma$ shrinks after incorporating the expected observation.

Approximate POMDP Solvers

Because exact belief-space value iteration is intractable, a family of approximate solvers has developed around point-based methods: rather than representing the value function over the entire belief simplex, maintain $\alpha$ -vectors anchored to a finite set of representative beliefs.

PBVI (point-based value iteration, Pineau, Gordon, Thrun, IJCAI 2003) selects a small set of reachable beliefs $B \subset \Delta(S)$ and performs Bellman backups only at those points. This is the canonical approximate POMDP solver and made previously hopeless problems (hundreds of states) tractable.
SARSOP (Kurniawati, Hsu, Lee, RSS 2008) refines PBVI by sampling beliefs from the optimally reachable space under the current policy, tightening bounds via $\alpha$ -vector pruning. SARSOP is often the default choice for modern discrete-state POMDP planning.
QMDP (Littman, Cassandra, Kaelbling, ICML 1995) ignores future partial observability: it solves the underlying MDP, then acts by $a^* = \arg\max_a \sum_s b(s)\, Q_{\mathrm{MDP}}(s, a)$ . QMDP is cheap and often a strong baseline, but underplans in information-gathering tasks because it assumes the state becomes fully observable after one step.
DRQN (Hausknecht, Stone, arXiv:1507.06527, 2015) extends DQN with an LSTM layer, using recurrent hidden state as a learned belief approximator. This is the archetype for deep-RL POMDP approaches and is common for visual navigation where an explicit belief is infeasible.

For continuous belief spaces arising in Active SLAM, the practical workhorses remain information-theoretic greedy planners over a finite candidate action set, occasionally combined with short-horizon belief-space trajectory optimization.

Classical Active SLAM Baselines

Before the POMDP framing dominated, the standard Active SLAM pipeline was frontier-based exploration (Yamauchi, A frontier-based approach for autonomous exploration, CIRA 1997). A frontier is a cell on the boundary between known free space and unknown space in the occupancy grid. The robot repeatedly drives to the nearest (or highest-utility) frontier, expanding the known map. This is purely map-driven and does not reason about pose uncertainty, but it remains a strong baseline and a common ablation target. Information-theoretic Active SLAM generalizes this idea by scoring frontiers via expected information gain rather than geometric distance alone.

Exploration vs. Exploitation

Active SLAM exhibits a specific form of the exploration-exploitation tradeoff. Exploring unknown regions adds new landmarks to the map but increases pose uncertainty (the robot moves away from known landmarks). Exploiting known landmarks improves localization but does not expand the map. Good Active SLAM policies interleave: explore to extend the map, then revisit known areas to tighten the pose estimate before exploring again.

This mirrors the exploration-exploitation tradeoff in reinforcement learning, but the "reward" is information rather than task performance. In multi-objective Active SLAM, the robot also has a task goal (reach a destination, inspect an area), and must balance task progress against information gathering.

Common Confusions

Watch Out

POMDPs are not just MDPs with noise

Adding observation noise to an MDP does not make it a POMDP in any useful sense. The key feature of a POMDP is that the agent must maintain a belief state and plan over it. An agent that ignores its uncertainty and treats the most likely state as the true state (certainty-equivalent control) can perform arbitrarily badly in POMDPs. The belief state is not optional.

Watch Out

Information gain is not the same as uncertainty reduction in the map alone

Active SLAM must reduce uncertainty in both the pose and the map jointly. A policy that minimizes map entropy while ignoring pose uncertainty will produce overconfident maps with incorrect geometry. The joint covariance over (pose, map) is what matters.

Summary

Active SLAM selects actions to maximize mapping and localization quality
The POMDP formulation captures partial observability: the robot does not know its exact pose or the full map
Belief states (distributions over possible states) are the natural planning space
The Bellman equation applies in belief space but is intractable for realistic SLAM problems
Greedy information-theoretic objectives (maximize expected information gain) are the practical workhorse
The exploration-exploitation tradeoff in Active SLAM mirrors the one in RL but concerns spatial information

Exercises

ExerciseCore

Problem

Consider a robot on a 1D line with two possible positions: $x = 0$ or $x = 1$ . The belief is $b = (p, 1-p)$ where $p = P(x=0)$ . The robot can take action "stay" (no movement) or "move" (deterministically moves to the other position). There is a landmark at $x = 0$ that produces observation $o = 1$ with probability 0.9 if the robot is at $x = 0$ , and $o = 1$ with probability 0.1 if at $x = 1$ . Starting from $b = (0.5, 0.5)$ , compute the expected entropy of the belief after taking action "stay" and receiving an observation.

ExerciseAdvanced

Problem

In a Gaussian Active SLAM system, the joint state is $s = (x, m) \in \mathbb{R}^d$ with belief $\mathcal{N}(\mu, \Sigma)$ . A candidate action produces a linear observation $z = Hs + \eta$ where $\eta \sim \mathcal{N}(0, R)$ . Show that the expected information gain is $\frac{1}{2} \log \det(\Sigma) - \frac{1}{2} \log \det(\Sigma - \Sigma H^\top (H \Sigma H^\top + R)^{-1} H \Sigma)$ and explain why this depends only on $\Sigma$ , $H$ , and $R$ , not on $\mu$ .

Further directions

Policy graph / finite-state controllers (FSC) representation
Expected map entropy (Stachniss-Grisetti-Burgard 2005) developed fully
Rao-Blackwellized particle filters expanded
Information gain for occupancy grids (practical variant)
Myopic vs non-myopic planning tradeoffs
Belief-space visualization (interactive demo)

References

Canonical:

Kaelbling, Littman, Cassandra, Planning and Acting in Partially Observable Stochastic Domains, Artificial Intelligence (1998)
Littman, Cassandra, Kaelbling, Learning Policies for Partially Observable Environments: Scaling Up, ICML (1995). QMDP heuristic.
Pineau, Gordon, Thrun, Point-based Value Iteration: An Anytime Algorithm for POMDPs, IJCAI (2003). PBVI.
Kurniawati, Hsu, Lee, SARSOP: Efficient Point-Based POMDP Planning by Approximating Optimally Reachable Belief Spaces, RSS (2008).
Yamauchi, A Frontier-Based Approach for Autonomous Exploration, IEEE CIRA (1997). Classical frontier exploration baseline.
Stachniss, Grisetti, Burgard, Information Gain-based Exploration Using Rao-Blackwellized Particle Filters, RSS (2005).

Current:

Hausknecht, Stone, Deep Recurrent Q-Learning for Partially Observable MDPs, arXiv:1507.06527 (2015). DRQN.
Placed et al., A Survey on Active Simultaneous Localization and Mapping, Robotics and Autonomous Systems (2023)
Lluvia, Lazkano, Ansuategui, Active Mapping and Robot Exploration: A Survey, Sensors (2021)

Next Topics

Further research in this area connects to multi-robot exploration and deep RL for navigation

Last reviewed: April 18, 2026

Canonical graph

Required before and derived from this topic

These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.

Full prerequisite chain All derived topics

Required prerequisites

3

Markov Decision Processeslayer 2 · tier 1
GraphSLAM and Factor Graphslayer 3 · tier 2
Visual and Semantic SLAMlayer 4 · tier 3

Derived topics

0

No published topic currently declares this as a prerequisite.