Paper breakdown

SAM 3: Segment Anything with Concepts

Nicolas Carion et al. · 2026 · arXiv preprint

Generalizes the Segment Anything line from per-prompt single-object segmentation to Promptable Concept Segmentation: given a noun phrase or image exemplar, return masks for every matching instance across an image or short video. Adds a presence token that decouples recognition from localization, doubling open-vocabulary segmentation accuracy on the new SA-Co benchmark.

arXiv:2511.16719

Overview

Carion and collaborators (2026) extend the Segment Anything line to the Promptable Concept Segmentation (PCS) task. The earlier SAM and SAM 2 papers fixed the visual prompt — a point, box, or coarse mask — and returned a single object instance. SAM 3 fixes a concept — a short noun phrase like "yellow school bus" or one or more image exemplars — and returns masks for every matching instance across the image, or a tracked masklet across a short video. The model also subsumes the prior promptable visual segmentation (PVS) capability, so a single SAM 3 checkpoint handles both interactive single-object refinement and concept-level mass segmentation.

The architectural shift behind PCS is mechanistic. SAM 1 and SAM 2 were prompt encoders followed by a single-object mask decoder; the search over instances had to be done outside the model with an open-vocabulary detector like OWLv2 feeding boxes back in. SAM 3 internalizes that detector. It is a DETR-style image-level model whose object queries cross-attend to text and exemplar embeddings; each query independently predicts a presence-conditioned match score and a mask. The novel architectural piece is a per-frame presence token that predicts $p(\text{NP present in image})$ globally, decoupling the "is this concept here at all" decision from the per-query localization decision. Empirically this lifts cgF1 by 6–8 points on long-tail concepts where the recognition signal is weak.

The headline empirical numbers: zero-shot LVIS mask AP $48.8$ versus the strongest prior baseline at $38.5$ ; cgF1 (the PCS evaluation metric) of $54.1$ on SA-Co/Gold versus $24.5$ for OWLv2 $^*$ — more than $2\times$ . On video PCS the model maintains near-real-time inference latency for $\sim 5$ concurrent tracked objects on an H200 GPU. The accompanying SA-Co dataset has 5.2M images and 4M unique noun phrases, the largest open-vocabulary segmentation dataset to date.

Mathematical Contributions

The Promptable Concept Segmentation task

Given an image or short video $V$ ( $\le 30$ s) and a concept prompt $P$ — a noun phrase, a set of positive and negative image exemplars, or both — the model must produce a set of instance masks $\{m_i\}$ together with stable instance identities preserved across video frames. Formally, $\text{PCS}: (V, P) \mapsto \{(m_i, \text{id}_i)\}_{i=1}^{N(V,P)}$ where $N(V,P)$ is the number of matching instances. The vocabulary is restricted to atomic visual concepts — simple noun phrases of a noun and optional modifiers — to keep the task well-defined; complex referring expressions are deferred to the MLLM-prompted variant.

This generalizes both PVS (where $P$ is a click and $N = 1$ ) and open-vocabulary detection (where $P$ is text and the output is boxes, not masks). It also generalizes single-image segmentation to video by carrying instance identities across frames, which is the SAM 2 contribution.

Architecture: detector + tracker over a shared backbone

The model has two computation paths sharing a Perception Encoder (PE) backbone:

A detector that on each frame $I_t$ produces a set of image-level mask proposals $\mathcal O_t = \{m_i^{\text{det}}\}$ .
A tracker that takes the prior frame's masklets $\mathcal M_{t-1}$ and propagates them to $\hat{\mathcal M}_t$ via the SAM 2 memory-bank mechanism.
A matching function $\text{match\_and\_update}$ that fuses the two via IoU-based association: $\mathcal M_t = \text{match\_and\_update}(\hat{\mathcal M}_t, \mathcal O_t)$ .

Concretely: $\hat{\mathcal M}_t = \text{propagate}(\mathcal M_{t-1}), \qquad \mathcal O_t = \text{detect}(I_t, P), \qquad \mathcal M_t = \text{match\_and\_update}(\hat{\mathcal M}_t, \mathcal O_t).$ Detector and tracker are trained separately. The detector learns identity-agnostic mask proposal; the tracker, with the PE backbone frozen after detector training, learns identity preservation across frames. The decoupling matters because the two objectives conflict: a detector wants to be invariant to instance identity within a class, a tracker wants to discriminate identities.

The presence token

A standard DETR-style detector ties recognition (is this query a match for the concept?) and localization (where is it?) into a single object-query prediction. For low-frequency concepts this overloads the query: it must both maintain global context (does this concept appear anywhere in the image?) and produce a localized box. The two demands pull in opposite directions because attention to the global frame hurts box-tightness, and attention to a local patch hurts global recognition.

SAM 3 introduces a learned global presence token whose only job is to predict $p_{\text{pres}} = p(\text{NP is present in the input}).$ Each per-query proposal $q_i$ then only solves the conditional localization problem $p(q_i \text{ is a match} \mid \text{NP is present}).$ The final score for query $q_i$ is the product $\text{score}(q_i) = p_{\text{pres}} \cdot p(q_i \mid \text{NP present}).$ The decomposition is mechanically clean: the presence token gets the full image's contextual cues, the query keeps its localization-friendly attention pattern. Ablations in the paper show the presence token contributes $\sim 6$ cgF1 on SA-Co/Gold, a sizable fraction of the gap to the strongest prior baseline.

Detector training losses

The detector inherits the DETR template — Hungarian matching plus a sum of classification, box, and mask losses — with two additions:

DAC-DETR dual supervision (Hu et al., 2023): every decoder layer carries an auxiliary loss against the matched ground truth, so gradients propagate before the final layer.
Align loss (Cai et al., 2024): aligns the classification logit with the predicted box quality, so high-confidence boxes are forced to also be tight.

The mask head is adapted from MaskFormer; a separate semantic-segmentation head predicts a binary "is this pixel part of the prompt concept" mask for every pixel. Together with the per-instance mask head this gives both an instance-level and a class-level segmentation output.

Image exemplars as prompt tokens

Image exemplars are pairs $(B, \ell)$ : a bounding box and a binary positive/negative label. The exemplar encoder takes the box position embedding, the label embedding, and ROI-pooled visual features from inside the box, runs them through a small transformer, and produces a token sequence. These tokens are concatenated with the text-prompt tokens, so the prompt can be text-only, exemplar-only, or hybrid. Critically, exemplar prompts are frame-local (added on a single frame) while text prompts are globally applied to every frame; this matches the way a human annotator wants to refine a video — fix one frame, generalize the rest.

Video tracking via SAM 2 propagation

The tracker reuses the SAM 2 architecture: a memory encoder that produces a key-value cache from past frames, a memory attention layer that conditions the current-frame mask decoder on the cache, and an occlusion head that predicts whether the tracked object is visible. The mask decoder is the same two-way attention transformer as in SAM 2, predicting three candidate masks per object per frame and selecting the highest-confidence one.

Two new disambiguation strategies handle the merge of detector outputs and tracker predictions:

Masklet detection score. A masklet whose IoU with detector outputs falls below a threshold for too many recent frames is suppressed. This kills tracker hallucinations that drift off the actual object.
Periodic re-prompting. Every few frames, the tracker is re-anchored on high-confidence detector masks $\mathcal O_t$ rather than its own propagated mask $\hat{\mathcal M}_t$ . The memory bank is refreshed with detector outputs to prevent error accumulation.

Calibration-aware evaluation: cgF1

Standard detection metrics (AP, mask AP) integrate over the full precision-recall curve and reward models that produce many low-confidence predictions. For an interactive segmentation tool that is the wrong incentive: in deployment a model needs to know when to not predict. SAM 3 evaluates predictions only above confidence $0.5$ , then computes:

Localization as positive micro F1 (pmF1) restricted to image-NP pairs containing at least one ground-truth instance.
Classification as the image-level Matthews Correlation Coefficient $\text{IL\_MCC} \in [-1, 1]$ for the binary "is the concept present in this image" task.

The composite metric, classification-gated F1, is $\text{cgF1} = 100 \cdot \text{pmF1} \cdot \text{IL\_MCC}.$ A model with poor classification gets its localization score zeroed out. This penalizes systems that hallucinate the concept on every image, which standard AP would not.

The data engine: AI verifiers double throughput

Training PCS at scale required 5.2M annotated images and 52.5K video clips. Pure human annotation is infeasible; SAM 3 builds a four-phase data engine where AI verifiers progressively replace humans:

Phase 1. Human-only verification of mask-quality (MV) and exhaustivity (EV). Initial 4.3M image–NP pairs collected with SAM 2 as the proposal model.
Phase 2. Llama 3.2 fine-tuned on the Phase 1 labels becomes the AI verifier; it auto-labels routine cases, humans focus on hard ones. Throughput roughly doubles. Adds 122M image–NP pairs.
Phase 3. Domain expansion via a 22.4M-node Wikidata-derived ontology for concept coverage; 15 visual domains from medical to natural imagery. Adds 19.5M image–NP pairs.
Phase 4. Video annotation using the trained SAM 3 image model as the proposal source. Humans focus on tracking-failure clips selected by the model's own uncertainty.

The general pattern — AI annotators verifying or generating labels, humans focused on the model's failure modes — follows the same self-improvement template that RLHF and constitutional AI use for language data, applied to segmentation.

What SAM 3 does not do

SAM 3 does not handle complex referring expressions ("the third person from the left wearing red"). The vocabulary is restricted to atomic noun phrases; for compositional language the paper proposes coupling SAM 3 with an MLLM that decomposes the query, but that pipeline is outside the core model.

It does not segment audio events, document regions, or 3D point clouds. The output is per-frame 2D masks tied to RGB pixels; extending to 3D is mentioned as future work. Inference latency scales with the number of tracked objects; for $>10$ concurrent tracked objects, real-time performance breaks down on a single GPU. The model is also not a generalist: it is a segmentation specialist, not an open-ended VLM.

Connections to TheoremPath Topics

Object detection and segmentation — the broader task family; SAM 3 sits alongside Mask R-CNN, DINO, OWLv2.
Vision transformer lineage — the Perception Encoder backbone is in the ViT family.
Florence and vision foundation models — peer-class generalist vision-language model.
CLIP and OpenCLIP in practice — the contrastive image-text alignment that the text-encoding side of SAM 3 builds on.
Self-supervised vision — the pretraining family the PE backbone comes from.
Attention mechanism theory — the cross-attention that fuses prompt tokens with image features.
Multimodal RAG — the deployment surface that benefits from open-vocabulary mask retrieval.

Why It Matters Now

The "find every X in this image" problem has been awkwardly split between two pipelines for a decade: detection (gives you boxes for the things in a fixed vocabulary) and segmentation (gives you masks once you point at the thing). Open-vocabulary detectors (OWL-ViT, OWLv2, GroundingDINO) close one half; promptable segmenters (SAM 1, SAM 2) close the other. SAM 3 is the first system to unify them under one model with one set of weights, and the unification matters: a frozen detector feeding a frozen segmenter has compounding errors, identity-preservation gaps, and inconsistent prompt semantics. A single model with a presence token and shared attention has none of those.

The doubling of cgF1 against the strongest prior baselines on SA-Co/Gold (and the $+10$ -point AP on LVIS) is large in this benchmark family. Most recent vision-language detection results have been incremental — $+1$ or $+2$ AP, often inside training-data overlap. The $+10$ AP gap here is plausibly real because LVIS is held out from SAM 3's training data and is annotated in a different schema than SA-Co.

The data-engine story is also worth flagging. The standard refrain that "data is the bottleneck for vision foundation models" is true but unhelpful; SAM 3's recipe is more concrete. Use a proposal model (SAM 2). Use AI verifiers fine-tuned on the human-only labels from the prior phase. Reserve human effort for the edge cases that the verifiers are uncertain about. Iterate. The SA-Co dataset is $\sim 50\times$ the unique-concept count of LVIS at $\sim 100\times$ the mask count, produced by roughly the same annotator headcount that produced LVIS — so the engine multiplied effective throughput by $\sim 50$ .

The deployment angle. Robotics, AR, and content-creation systems have wanted open-vocabulary segmentation for years; the prior options were either closed-vocabulary detectors retrained per task, or chained pipelines with brittle interfaces. A single SAM 3 checkpoint that takes a noun phrase or an exemplar and runs at near-real-time on an H200 is the first credible production-grade option for these workloads. Expect the next year of robotics and AR demos to lean on it heavily.

References

Canonical:

Carion, N. et al. (2026). "SAM 3: Segment Anything with Concepts." arXiv preprint. arXiv:2511.16719. Project page: ai.meta.com/sam3.

Direct precursors (SAM family):

Kirillov, A. et al. (2023). "Segment Anything." ICCV 2023. arXiv:2304.02643. The original SAM with point/box/mask prompts.
Ravi, N. et al. (2024). "SAM 2: Segment Anything in Images and Videos." ICLR 2025. arXiv:2408.00714. Adds video segmentation with memory attention; tracker inherited by SAM 3.

Detection architecture:

Carion, N. et al. (2020). "End-to-End Object Detection with Transformers." ECCV 2020. arXiv:2005.12872. DETR — the architectural template SAM 3 inherits.
Zhu, X. et al. (2021). "Deformable DETR: Deformable Transformers for End-to-End Object Detection." ICLR 2021. arXiv:2010.04159. The bbox-delta refinement SAM 3 uses.
Hu, X. et al. (2023). "DAC-DETR: Divide the Attention Layers and Conquer." NeurIPS 2023. arXiv:2304.07849. Source of the dual-supervision auxiliary loss.

Open-vocabulary detection (the prior SOTA family):

Minderer, M. et al. (2024). "Scaling Open-Vocabulary Object Detection (OWLv2)." NeurIPS 2023. arXiv:2306.09683.
Liu, S. et al. (2023). "Grounding DINO: Marrying DINO with Grounded Pre-Training." ECCV 2024. arXiv:2303.05499.
Cheng, B. et al. (2021). "Per-Pixel Classification is Not All You Need for Semantic Segmentation (MaskFormer)." NeurIPS 2021. arXiv:2107.06278. Source of SAM 3's mask head.

Backbone:

Bolya, D. et al. (2025). "Perception Encoder: The best visual embeddings are not at the output of the network." arXiv:2504.13181. The PE backbone shared between detector and tracker.

Standard textbook:

Prince, S. J. D. (2023). Understanding Deep Learning. MIT Press. Chapter 18 — vision; Chapter 19 — generative and segmentation models.

Connected topics

Last reviewed: May 6, 2026