RL Theory
Visual and Semantic SLAM
Replacing laser range finders with cameras for SLAM, and enriching maps with semantic labels to improve data association and planning.
Prerequisites
Why This Matters
Classical SLAM systems use laser range finders to measure distances. These sensors are expensive, heavy, and produce geometric point clouds with no notion of what objects are present. Cameras are cheap, lightweight, and capture rich appearance information. Visual SLAM extracts geometric structure from image sequences. Semantic SLAM goes further: it labels map elements with object categories, producing maps that a robot can reason about ("navigate to the chair" rather than "navigate to point (3.2, 1.7)").
Mental Model
Visual SLAM solves two problems simultaneously from camera images: where is the robot, and what does the world look like? Feature points tracked across frames give geometric constraints. When the robot revisits a location, matching features trigger loop closures that correct accumulated drift. Semantic SLAM adds a third question: what are the objects in the scene? Attaching labels like "door", "table", or "wall" to map regions enables higher-level planning and more robust data association.
Visual SLAM Foundations
Feature Extraction
Visual SLAM begins with detecting and describing keypoints in images. A keypoint detector finds distinctive image locations (corners, blobs). A descriptor encodes the local appearance around each keypoint as a vector.
Feature Descriptor
A feature descriptor is a -dimensional vector encoding the local image patch around keypoint . Two keypoints in different images are matched when is below a threshold. Standard descriptors include ORB (binary, 256 bits), SIFT (float, 128 dimensions), and SuperPoint (learned, 256 dimensions).
ORB (Oriented FAST and Rotated BRIEF) is fast and produces binary descriptors matched with Hamming distance. SuperPoint is a learned detector/descriptor trained on synthetic homographies, offering better performance under illumination changes at higher compute cost.
Visual Odometry
Visual odometry estimates the camera motion between consecutive frames using matched features.
Essential Matrix
For calibrated cameras, the essential matrix encodes the relative rotation and translation between two views. It satisfies where is the skew-symmetric matrix of . For corresponding points in normalized coordinates: .
Epipolar Constraint for Visual Odometry
Statement
Given point correspondences in normalized image coordinates from two calibrated views, the essential matrix satisfying for all can be recovered up to scale. Decomposing via SVD yields the relative rotation and translation direction .
Intuition
Each point correspondence constrains the camera motion by one equation. Five equations suffice to pin down the five degrees of freedom of (3 rotation, 2 translation direction). Translation magnitude is unrecoverable from images alone without additional information.
Proof Sketch
Stack the epipolar constraints into a linear system where each row comes from one correspondence. With 5 or more points, the null space of is one-dimensional (generically), giving up to scale. The constraint that has two equal singular values and one zero singular value resolves remaining ambiguity. SVD of gives and .
Why It Matters
This is the geometric foundation of visual odometry. Every visual SLAM system uses this or a variant (the fundamental matrix for uncalibrated cameras) to estimate frame-to-frame motion. RANSAC is applied on top to handle outlier matches.
Failure Mode
Fails when the scene is planar (degenerate configuration requiring a homography instead), when there is pure rotation with no translation (epipolar geometry is undefined), or when feature matches are predominantly incorrect. Also degrades with rolling shutter cameras where the pinhole model is violated.
Loop Closure Detection
As the robot moves, visual odometry accumulates drift. Loop closure detects when the robot returns to a previously visited location by matching current features against a database of past keyframes.
The standard approach uses bag-of-visual-words: feature descriptors are quantized into a visual vocabulary (trained offline via k-means on a large descriptor set), and each keyframe is represented as a histogram over visual words. A new frame queries the database using tf-idf scoring. When a match is found, a geometric verification step (computing the relative pose) confirms the loop closure, and the pose graph is optimized.
Semantic SLAM
Semantic Map
A semantic map augments a geometric map with object-level labels. Each map element (point, surface, or region) carries a category label with an associated probability . The map encodes both where things are and what they are.
Why Semantics Helps
Semantic labels improve SLAM in three ways. First, data association becomes more robust: matching a detected "chair" in the current frame to a "chair" in the map is more reliable than matching anonymous point clouds. Second, semantic maps enable task-level planning: a robot told to "go to the kitchen" can use semantic labels to identify the goal. Third, semantic priors constrain geometry: floors are horizontal, walls are vertical, and objects have expected sizes.
Architectures
ORB-SLAM (Mur-Artal et al., 2015) is a feature-based visual SLAM system with three parallel threads: tracking (visual odometry), local mapping (bundle adjustment over recent keyframes), and loop closing (bag-of-words detection + pose graph optimization). ORB-SLAM2 extends this to stereo and RGB-D cameras. ORB-SLAM3 adds IMU integration and multi-map support.
RTAB-MAP (Labbe and Michaud, 2019) is a graph-based SLAM system designed for long-term operation. It manages memory by transferring old nodes to long-term storage and retrieving them when loop closures are detected, allowing it to run indefinitely without growing memory usage.
Semantic SLAM systems typically run a CNN-based object detector or segmentation network (Mask R-CNN, YOLO) in parallel with the geometric SLAM pipeline and fuse the outputs. The object detections provide semantic labels that are projected into the 3D map.
Common Confusions
Visual odometry is not visual SLAM
Visual odometry estimates the local trajectory from frame to frame. Visual SLAM also builds a persistent map and performs loop closures to correct global drift. A system doing only visual odometry will accumulate unbounded drift over time.
Semantic SLAM does not require perfect segmentation
The semantic labels do not need to be correct in every frame. Bayesian fusion over multiple observations from different viewpoints allows the map to converge to correct labels even when individual detections are noisy. A chair misclassified as a table in one frame will be corrected by ten frames that classify it correctly.
Key Takeaways
- Visual SLAM replaces lidar with cameras, using feature matching and epipolar geometry for pose estimation
- Loop closure detection corrects accumulated drift by recognizing revisited locations
- Semantic SLAM attaches category labels to map elements, improving data association and enabling task-level planning
- ORB-SLAM is the standard open-source feature-based visual SLAM system
- Semantic labels are fused probabilistically over time, so individual detection errors are tolerable
Exercises
Problem
A monocular visual odometry system estimates translation only up to scale. Explain why, and describe two ways to recover absolute scale.
Problem
In a semantic SLAM system, you observe an object from viewpoints. Each viewpoint produces a categorical distribution over classes from a segmentation network. Assuming observations are conditionally independent given the true class, write the Bayesian update for the posterior and show how it simplifies to a product of likelihoods.
References
Canonical:
- Mur-Artal, Montiel, Tardos, ORB-SLAM: A Versatile and Accurate Monocular SLAM System, IEEE T-RO (2015)
- Hartley and Zisserman, Multiple View Geometry in Computer Vision (2004), Chapters 9-11
Current:
- Campos et al., ORB-SLAM3: An Accurate Open-Source Library for Visual, Visual-Inertial and Multi-Map SLAM, IEEE T-RO (2021)
- Labbe and Michaud, RTAB-MAP as an Open-Source Lidar and Visual SLAM Library for Large-Scale and Long-Term Online Operation, JFRP (2019)
Next Topics
- Active SLAM and POMDPs: choosing actions to maximize map quality
Last reviewed: April 2026
Prerequisites
Foundations this topic depends on.
- GraphSLAM and Factor GraphsLayer 3