Beta. Content is under active construction and has not been peer-reviewed. Report errors on GitHub.Disclaimer

ML Methods

Object Detection and Segmentation

Localizing and classifying objects in images: two-stage (R-CNN), one-stage (YOLO, SSD), anchor-free (CenterNet, FCOS) detectors, semantic and instance segmentation, and the IoU/mAP evaluation framework.

AdvancedTier 2Current~55 min
0

Why This Matters

Object detection and segmentation extend image classification from "what is in this image?" to "where is each object, and what is it?" Building on convolutional neural networks, this is the foundation of autonomous driving, medical imaging, robotics, and visual understanding systems.

The core challenge: outputting a variable number of bounding boxes (or pixel masks) from a fixed-size input. Different architectures trade off speed, accuracy, and design complexity to solve this.

Mental Model

Image classification assigns one label per image. Object detection assigns multiple bounding boxes, each with a class label and confidence score. Segmentation assigns a label to each pixel. As you move from classification to detection to segmentation, the output becomes more spatially precise and the problem becomes harder.

Evaluation Metrics

Definition

Intersection over Union

For a predicted bounding box BpB_p and ground truth box BgtB_{gt}:

IoU(Bp,Bgt)=BpBgtBpBgt\text{IoU}(B_p, B_{gt}) = \frac{|B_p \cap B_{gt}|}{|B_p \cup B_{gt}|}

IoU ranges from 0 (no overlap) to 1 (perfect overlap). A detection is considered correct if IoUτ\text{IoU} \geq \tau for some threshold τ\tau (commonly τ=0.5\tau = 0.5).

Proposition

IoU as a Metric

Statement

IoU satisfies: (1) 0IoU10 \leq \text{IoU} \leq 1, (2) IoU(A,B)=1\text{IoU}(A, B) = 1 if and only if A=BA = B, (3) IoU(A,B)=IoU(B,A)\text{IoU}(A, B) = \text{IoU}(B, A). However, 1IoU1 - \text{IoU} is a valid metric (satisfies the triangle inequality) on the space of compact sets.

Intuition

IoU measures the fraction of the combined area that is shared. It penalizes both missed regions and extra predicted regions equally. The metric property of 1IoU1 - \text{IoU} means you can use it as a proper distance between shapes.

Proof Sketch

Symmetry and boundedness follow from the definition. For the triangle inequality of 1IoU1 - \text{IoU}, use the fact that ABAC+BCC|A \cup B| \leq |A \cup C| + |B \cup C| - |C| and relate this to the intersection terms.

Why It Matters

IoU is the standard evaluation metric for detection and segmentation. Understanding that it penalizes both false positive area and false negative area equally explains why models must be precise in both box size and location.

Failure Mode

IoU is zero when boxes do not overlap at all, even if they are very close. This makes IoU a poor loss function for training (zero gradient for non-overlapping predictions). Generalized IoU (GIoU) and Distance IoU (DIoU) address this by incorporating the gap between non-overlapping boxes.

Definition

Mean Average Precision

For each class, compute precision-recall by varying the confidence threshold. Average Precision (AP) is the area under the precision-recall curve. mAP is the mean of AP across all classes. COCO mAP averages over multiple IoU thresholds (0.5 to 0.95 in steps of 0.05).

Two-Stage Detectors: R-CNN Family

R-CNN (2014): Generate ~2000 region proposals (via selective search), run a CNN on each, classify with an SVM. Slow: one forward pass per region.

Fast R-CNN: Run the CNN once on the entire image, extract features for each region from the shared feature map using RoI pooling. Much faster: one CNN forward pass total. Training uses cross-entropy loss for classification and smooth L1 loss for bounding box regression.

Faster R-CNN: Replace selective search with a Region Proposal Network (RPN) that shares convolutional features with the detector. The RPN predicts region proposals from anchor boxes at each spatial location. End-to-end trainable.

The two-stage design (propose then classify) gives high accuracy but lower speed than one-stage methods.

One-Stage Detectors: YOLO and SSD

YOLO (You Only Look Once): Divide the image into an S×SS \times S grid. Each grid cell predicts BB bounding boxes and class probabilities directly. One forward pass predicts all objects. Fast enough for real-time inference.

SSD (Single Shot MultiBox Detector): Predict boxes at multiple scales from feature maps at different resolutions in the CNN backbone. This handles objects of different sizes without the region proposal step.

The key tradeoff: one-stage detectors are faster but historically less accurate on small objects. YOLOv8 and later versions have largely closed this accuracy gap.

Anchor-Free Detectors

Anchor-based methods (Faster R-CNN, YOLO, SSD) predefine a set of reference boxes at each location. The model predicts offsets from these anchors. Choosing anchor sizes and aspect ratios requires dataset-specific tuning.

CenterNet: Predict the center point of each object as a heatmap peak, then regress the width and height. No anchors needed.

FCOS (Fully Convolutional One-Stage): At each spatial location, predict whether it is inside any ground truth box, and if so, predict the distances to the four box boundaries. Removes all anchor-related hyperparameters.

Anchor-free methods simplify the pipeline and reduce the number of design choices.

Segmentation

Definition

Semantic Segmentation

Assign a class label to every pixel in the image. All pixels of the same class share one label, regardless of which object instance they belong to. Two cars next to each other are labeled identically.

Definition

Instance Segmentation

Detect each object and produce a per-pixel mask for it. Two adjacent cars get separate masks. Mask R-CNN extends Faster R-CNN by adding a mask prediction branch in parallel with the box and class branches.

Definition

Panoptic Segmentation

Combine semantic and instance segmentation: every pixel gets a class label, and every countable object gets a unique instance ID. Background classes (sky, road) get semantic labels only; foreground objects (cars, people) get both semantic and instance labels.

SAM: Segment Anything Model

SAM (2023) is a promptable segmentation foundation model. Given an image and a prompt (point, box, or text), SAM produces a segmentation mask. Trained on over 1 billion masks, SAM generalizes to objects and image types not seen during training.

The architecture: a heavyweight image encoder (ViT) runs once per image, producing image embeddings. A lightweight mask decoder takes the prompt and image embeddings and produces masks in real time. This design amortizes the expensive encoder across multiple prompts.

SAM demonstrates that segmentation can be treated as a foundation model problem rather than a task-specific one.

Common Confusions

Watch Out

IoU threshold changes what counts as correct

At IoU 0.5 (the PASCAL VOC standard), a box that covers half the object counts as correct. At IoU 0.75, the box must be much more precise. COCO mAP averages over thresholds from 0.5 to 0.95, which is a much stricter evaluation than single-threshold mAP. Always check which mAP variant a paper reports.

Watch Out

Non-maximum suppression is a post-processing step, not part of the model

Most detectors produce many overlapping predictions for the same object. NMS removes duplicates by keeping the highest-confidence box and suppressing any box with IoU above a threshold. NMS is a heuristic: it can suppress correct detections of nearby objects. End-to-end set prediction methods like DETR avoid NMS entirely.

Watch Out

Semantic segmentation does not separate instances

If two people stand next to each other, semantic segmentation labels all their pixels as "person" with no way to distinguish person 1 from person 2. Instance segmentation is required for counting objects or tracking individuals.

Key Takeaways

  • Object detection outputs a variable number of boxes with class labels
  • Two-stage detectors (R-CNN family) propose then classify; one-stage (YOLO, SSD) predict directly
  • Anchor-free methods (CenterNet, FCOS) remove anchor hyperparameters
  • IoU measures box overlap; mAP aggregates precision-recall across classes and IoU thresholds
  • Semantic segmentation labels pixels; instance segmentation separates objects; panoptic does both
  • SAM treats segmentation as a foundation model with promptable inference

Exercises

ExerciseCore

Problem

A predicted box has corners at (10,10)(10, 10) and (50,50)(50, 50). The ground truth box has corners at (20,20)(20, 20) and (60,60)(60, 60). Compute the IoU.

ExerciseAdvanced

Problem

Explain why using IoU directly as a training loss is problematic when the predicted and ground truth boxes do not overlap. What property does Generalized IoU (GIoU) add to fix this?

References

Canonical:

  • Ren et al., "Faster R-CNN: Towards Real-Time Object Detection" (NeurIPS 2015)
  • Redmon et al., "You Only Look Once: Unified, Real-Time Object Detection" (CVPR 2016)

Current:

  • Kirillov et al., "Segment Anything" (ICCV 2023)

  • Tian et al., "FCOS: Fully Convolutional One-Stage Object Detection" (ICCV 2019)

  • Bishop, Pattern Recognition and Machine Learning (2006), Chapters 1-14

Last reviewed: April 2026

Prerequisites

Foundations this topic depends on.