Beta. Content is under active construction and has not been peer-reviewed. Report errors on GitHub.Disclaimer

RL Theory

Visual and Semantic SLAM

Replacing laser range finders with cameras for SLAM, and enriching maps with semantic labels to improve data association and planning.

AdvancedTier 3Current~50 min
0

Why This Matters

Classical SLAM systems use laser range finders to measure distances. These sensors are expensive, heavy, and produce geometric point clouds with no notion of what objects are present. Cameras are cheap, lightweight, and capture rich appearance information. Visual SLAM extracts geometric structure from image sequences. Semantic SLAM goes further: it labels map elements with object categories, producing maps that a robot can reason about ("navigate to the chair" rather than "navigate to point (3.2, 1.7)").

Mental Model

Visual SLAM solves two problems simultaneously from camera images: where is the robot, and what does the world look like? Feature points tracked across frames give geometric constraints. When the robot revisits a location, matching features trigger loop closures that correct accumulated drift. Semantic SLAM adds a third question: what are the objects in the scene? Attaching labels like "door", "table", or "wall" to map regions enables higher-level planning and more robust data association.

Visual SLAM Foundations

Feature Extraction

Visual SLAM begins with detecting and describing keypoints in images. A keypoint detector finds distinctive image locations (corners, blobs). A descriptor encodes the local appearance around each keypoint as a vector.

Definition

Feature Descriptor

A feature descriptor did_i is a kk-dimensional vector encoding the local image patch around keypoint ii. Two keypoints in different images are matched when didj\|d_i - d_j\| is below a threshold. Standard descriptors include ORB (binary, 256 bits), SIFT (float, 128 dimensions), and SuperPoint (learned, 256 dimensions).

ORB (Oriented FAST and Rotated BRIEF) is fast and produces binary descriptors matched with Hamming distance. SuperPoint is a learned detector/descriptor trained on synthetic homographies, offering better performance under illumination changes at higher compute cost.

Visual Odometry

Visual odometry estimates the camera motion between consecutive frames using matched features.

Definition

Essential Matrix

For calibrated cameras, the essential matrix ER3×3E \in \mathbb{R}^{3 \times 3} encodes the relative rotation RR and translation tt between two views. It satisfies E=[t]×RE = [t]_\times R where [t]×[t]_\times is the skew-symmetric matrix of tt. For corresponding points p1,p2p_1, p_2 in normalized coordinates: p2Ep1=0p_2^\top E\, p_1 = 0.

Theorem

Epipolar Constraint for Visual Odometry

Statement

Given n5n \geq 5 point correspondences {(p1(i),p2(i))}\{(p_1^{(i)}, p_2^{(i)})\} in normalized image coordinates from two calibrated views, the essential matrix EE satisfying p2(i)Ep1(i)=0p_2^{(i)\top} E\, p_1^{(i)} = 0 for all ii can be recovered up to scale. Decomposing EE via SVD yields the relative rotation RR and translation direction t/tt/\|t\|.

Intuition

Each point correspondence constrains the camera motion by one equation. Five equations suffice to pin down the five degrees of freedom of EE (3 rotation, 2 translation direction). Translation magnitude is unrecoverable from images alone without additional information.

Proof Sketch

Stack the epipolar constraints into a linear system Avec(E)=0A\, \text{vec}(E) = 0 where each row comes from one correspondence. With 5 or more points, the null space of AA is one-dimensional (generically), giving EE up to scale. The constraint that EE has two equal singular values and one zero singular value resolves remaining ambiguity. SVD of E=UΣVE = U \Sigma V^\top gives RR and tt.

Why It Matters

This is the geometric foundation of visual odometry. Every visual SLAM system uses this or a variant (the fundamental matrix for uncalibrated cameras) to estimate frame-to-frame motion. RANSAC is applied on top to handle outlier matches.

Failure Mode

Fails when the scene is planar (degenerate configuration requiring a homography instead), when there is pure rotation with no translation (epipolar geometry is undefined), or when feature matches are predominantly incorrect. Also degrades with rolling shutter cameras where the pinhole model is violated.

Loop Closure Detection

As the robot moves, visual odometry accumulates drift. Loop closure detects when the robot returns to a previously visited location by matching current features against a database of past keyframes.

The standard approach uses bag-of-visual-words: feature descriptors are quantized into a visual vocabulary (trained offline via k-means on a large descriptor set), and each keyframe is represented as a histogram over visual words. A new frame queries the database using tf-idf scoring. When a match is found, a geometric verification step (computing the relative pose) confirms the loop closure, and the pose graph is optimized.

Semantic SLAM

Definition

Semantic Map

A semantic map augments a geometric map with object-level labels. Each map element (point, surface, or region) carries a category label c{1,,K}c \in \{1, \ldots, K\} with an associated probability P(cobservations)P(c \mid \text{observations}). The map encodes both where things are and what they are.

Why Semantics Helps

Semantic labels improve SLAM in three ways. First, data association becomes more robust: matching a detected "chair" in the current frame to a "chair" in the map is more reliable than matching anonymous point clouds. Second, semantic maps enable task-level planning: a robot told to "go to the kitchen" can use semantic labels to identify the goal. Third, semantic priors constrain geometry: floors are horizontal, walls are vertical, and objects have expected sizes.

Architectures

ORB-SLAM (Mur-Artal et al., 2015) is a feature-based visual SLAM system with three parallel threads: tracking (visual odometry), local mapping (bundle adjustment over recent keyframes), and loop closing (bag-of-words detection + pose graph optimization). ORB-SLAM2 extends this to stereo and RGB-D cameras. ORB-SLAM3 adds IMU integration and multi-map support.

RTAB-MAP (Labbe and Michaud, 2019) is a graph-based SLAM system designed for long-term operation. It manages memory by transferring old nodes to long-term storage and retrieving them when loop closures are detected, allowing it to run indefinitely without growing memory usage.

Semantic SLAM systems typically run a CNN-based object detector or segmentation network (Mask R-CNN, YOLO) in parallel with the geometric SLAM pipeline and fuse the outputs. The object detections provide semantic labels that are projected into the 3D map.

Common Confusions

Watch Out

Visual odometry is not visual SLAM

Visual odometry estimates the local trajectory from frame to frame. Visual SLAM also builds a persistent map and performs loop closures to correct global drift. A system doing only visual odometry will accumulate unbounded drift over time.

Watch Out

Semantic SLAM does not require perfect segmentation

The semantic labels do not need to be correct in every frame. Bayesian fusion over multiple observations from different viewpoints allows the map to converge to correct labels even when individual detections are noisy. A chair misclassified as a table in one frame will be corrected by ten frames that classify it correctly.

Key Takeaways

  • Visual SLAM replaces lidar with cameras, using feature matching and epipolar geometry for pose estimation
  • Loop closure detection corrects accumulated drift by recognizing revisited locations
  • Semantic SLAM attaches category labels to map elements, improving data association and enabling task-level planning
  • ORB-SLAM is the standard open-source feature-based visual SLAM system
  • Semantic labels are fused probabilistically over time, so individual detection errors are tolerable

Exercises

ExerciseCore

Problem

A monocular visual odometry system estimates translation only up to scale. Explain why, and describe two ways to recover absolute scale.

ExerciseAdvanced

Problem

In a semantic SLAM system, you observe an object from TT viewpoints. Each viewpoint tt produces a categorical distribution qtΔKq_t \in \Delta^K over KK classes from a segmentation network. Assuming observations are conditionally independent given the true class, write the Bayesian update for the posterior P(cq1,,qT)P(c \mid q_1, \ldots, q_T) and show how it simplifies to a product of likelihoods.

References

Canonical:

  • Mur-Artal, Montiel, Tardos, ORB-SLAM: A Versatile and Accurate Monocular SLAM System, IEEE T-RO (2015)
  • Hartley and Zisserman, Multiple View Geometry in Computer Vision (2004), Chapters 9-11

Current:

  • Campos et al., ORB-SLAM3: An Accurate Open-Source Library for Visual, Visual-Inertial and Multi-Map SLAM, IEEE T-RO (2021)
  • Labbe and Michaud, RTAB-MAP as an Open-Source Lidar and Visual SLAM Library for Large-Scale and Long-Term Online Operation, JFRP (2019)

Next Topics

Last reviewed: April 2026

Prerequisites

Foundations this topic depends on.

Next Topics