RL Theory
Active SLAM and POMDPs
Choosing robot actions to simultaneously map an environment and localize, formulated as a partially observable Markov decision process over belief states.
Prerequisites
Why This Matters
Standard SLAM treats the robot as a passive observer: it receives sensor data from whatever trajectory it happens to follow and builds the best map it can. Active SLAM asks a different question: what actions should the robot take to build the best possible map? This transforms SLAM from an estimation problem into a decision problem. The natural framework is the POMDP, where the robot plans over belief states (probability distributions over its unknown pose and map).
Mental Model
The robot does not know its exact position or the full map. It maintains a belief: a probability distribution over all possible (pose, map) pairs. Each action the robot takes produces a new observation, which updates the belief. Some actions are more informative than others. Moving toward an unexplored corridor reduces map uncertainty. Revisiting a known landmark reduces pose uncertainty. The robot must balance exploring new regions (reducing map uncertainty) against exploiting known landmarks (reducing localization uncertainty). This is the exploration-exploitation tradeoff applied to spatial reasoning.
Formal Setup
POMDP
A partially observable Markov decision process consists of: state space , action space , transition model , reward function , observation space , observation model , and discount factor . The agent does not observe the state directly; it receives observations generated stochastically from the true state.
In Active SLAM, the state consists of the robot pose and the map . The transition model describes how the robot moves (with noise). The observation model describes sensor measurements given the pose and map. The reward function encodes the objective: reduce uncertainty, maximize coverage, or reach a goal.
Belief State
The belief state is a probability distribution over states: . It is a sufficient statistic for the history of actions and observations. For Active SLAM with Gaussian assumptions, the belief is parameterized by a mean vector and covariance matrix over the joint (pose, map) state.
The belief update follows Bayes rule. After taking action and receiving observation :
This is the standard POMDP belief update, which for Gaussian beliefs reduces to the extended Kalman filter prediction and update steps.
Core Theory
Bellman Equation in Belief Space
Statement
The optimal value function over belief states satisfies:
where is the expected immediate reward, is the observation probability, and is the updated belief after action and observation .
Intuition
A POMDP over states is equivalent to a (fully observable) MDP over beliefs. The belief state captures everything the agent knows. The Bellman equation in belief space has the same structure as the standard Bellman equation, but the "state" is now a probability distribution. The expectation over observations reflects that the agent does not know which observation it will receive.
Proof Sketch
Since the belief is a sufficient statistic for the history, the optimal policy depends on the history only through . The POMDP therefore reduces to a belief-MDP with transition kernel and reward . Apply the standard Bellman optimality theorem to this belief-MDP.
Why It Matters
This result converts Active SLAM from a problem with hidden state into a planning problem in belief space. In principle, any MDP planning algorithm (value iteration, policy gradient) can be applied. In practice, the belief space is infinite-dimensional, so approximations are necessary.
Failure Mode
The belief space is continuous and high-dimensional (for SLAM, the belief includes uncertainty over the entire map). Exact solutions are intractable except for very small problems. Practical Active SLAM systems use heuristics such as planning one step ahead with information-theoretic objectives, or sampling a finite set of candidate actions.
Information-Theoretic Objectives
Since exact POMDP solutions are intractable for SLAM, practitioners often use greedy or short-horizon approximations with information-theoretic reward functions.
Shannon Entropy of Belief
The entropy of a belief is (discrete) or (continuous). For a Gaussian belief with covariance , the differential entropy is . Reducing entropy corresponds to reducing uncertainty about the state.
A common Active SLAM objective is to choose the action that maximizes expected information gain:
This is the mutual information between the future observation and the state, conditioned on the current belief and action. For Gaussian beliefs, the expected information gain depends on how much the covariance shrinks after incorporating the expected observation.
Exploration vs. Exploitation
Active SLAM exhibits a specific form of the exploration-exploitation tradeoff. Exploring unknown regions adds new landmarks to the map but increases pose uncertainty (the robot moves away from known landmarks). Exploiting known landmarks improves localization but does not expand the map. Good Active SLAM policies interleave: explore to extend the map, then revisit known areas to tighten the pose estimate before exploring again.
This mirrors the exploration-exploitation tradeoff in reinforcement learning, but the "reward" is information rather than task performance. In multi-objective Active SLAM, the robot also has a task goal (reach a destination, inspect an area), and must balance task progress against information gathering.
Common Confusions
POMDPs are not just MDPs with noise
Adding observation noise to an MDP does not make it a POMDP in any useful sense. The key feature of a POMDP is that the agent must maintain a belief state and plan over it. An agent that ignores its uncertainty and treats the most likely state as the true state (certainty-equivalent control) can perform arbitrarily badly in POMDPs. The belief state is not optional.
Information gain is not the same as uncertainty reduction in the map alone
Active SLAM must reduce uncertainty in both the pose and the map jointly. A policy that minimizes map entropy while ignoring pose uncertainty will produce overconfident maps with incorrect geometry. The joint covariance over (pose, map) is what matters.
Key Takeaways
- Active SLAM selects actions to maximize mapping and localization quality
- The POMDP formulation captures partial observability: the robot does not know its exact pose or the full map
- Belief states (distributions over possible states) are the natural planning space
- The Bellman equation applies in belief space but is intractable for realistic SLAM problems
- Greedy information-theoretic objectives (maximize expected information gain) are the practical workhorse
- The exploration-exploitation tradeoff in Active SLAM mirrors the one in RL but concerns spatial information
Exercises
Problem
Consider a robot on a 1D line with two possible positions: or . The belief is where . The robot can take action "stay" (no movement) or "move" (deterministically moves to the other position). There is a landmark at that produces observation with probability 0.9 if the robot is at , and with probability 0.1 if at . Starting from , compute the expected entropy of the belief after taking action "stay" and receiving an observation.
Problem
In a Gaussian Active SLAM system, the joint state is with belief . A candidate action produces a linear observation where . Show that the expected information gain is and explain why this depends only on , , and , not on .
References
Canonical:
- Kaelbling, Littman, Cassandra, Planning and Acting in Partially Observable Stochastic Domains, Artificial Intelligence (1998)
- Stachniss, Grisetti, Burgard, Information Gain-based Exploration Using Rao-Blackwellized Particle Filters, RSS (2005)
Current:
- Placed et al., A Survey on Active Simultaneous Localization and Mapping, Robotics and Autonomous Systems (2023)
- Lluvia, Lazkano, Ansuategui, Active Mapping and Robot Exploration: A Survey, Sensors (2021)
Next Topics
- Further research in this area connects to multi-robot exploration and deep RL for navigation
Last reviewed: April 2026
Prerequisites
Foundations this topic depends on.
- GraphSLAM and Factor GraphsLayer 3
- Markov Decision ProcessesLayer 2
- Convex Optimization BasicsLayer 1
- Differentiation in RnLayer 0A
- Sets, Functions, and RelationsLayer 0A
- Basic Logic and Proof TechniquesLayer 0A
- Matrix Operations and PropertiesLayer 0A
- Concentration InequalitiesLayer 1
- Common Probability DistributionsLayer 0A
- Expectation, Variance, Covariance, and MomentsLayer 0A