Beta. Content is under active construction and has not been peer-reviewed. Report errors on GitHub.Disclaimer

Beyond Llms

Florence and Vision Foundation Models

Vision foundation models that unify classification, detection, segmentation, captioning, and VQA under a single pretrained backbone. Florence, Florence-2, and the path toward GPT-4V-style multimodal understanding.

AdvancedTier 2Frontier~50 min
0

Why This Matters

For decades, computer vision was a collection of separate problems solved by separate models: one model for classification, another for detection, another for segmentation, another for captioning. Each required its own architecture, its own training data format, and its own evaluation protocol.

Vision foundation models collapse this into a single pretrained backbone that handles all vision tasks. Train once on massive data, adapt to any task with minimal fine-tuning or just a text prompt. This parallels how large language models unified NLP tasks, and it represents the same kind of consolidation happening in vision.

Mental Model

Think of a vision transformer pretrained on billions of image-text pairs as having learned a general-purpose "visual language." Given an image, it produces rich representations that encode objects, their locations, their relationships, and their semantic meaning. Different tasks are just different questions asked of these representations: "What is in this image?" (classification), "Where are the objects?" (detection), "Describe this image" (captioning).

Florence-2 takes this further by formulating every vision task as a sequence-to-sequence problem: image in, text out, where the text format encodes the task-specific answer.

Formal Setup and Notation

Definition

Vision Foundation Model

A vision foundation model is a pretrained model fθ:IRf_\theta: \mathcal{I} \to \mathcal{R} that maps images to representations R\mathcal{R} useful across multiple downstream tasks. Formally, for a set of vision tasks {τ1,,τK}\{\tau_1, \ldots, \tau_K\} with task-specific heads {h1,,hK}\{h_1, \ldots, h_K\}, the foundation model minimizes:

k=1KλkLk(hk(fθ(I)),y(k))\sum_{k=1}^{K} \lambda_k \mathcal{L}_k(h_k(f_\theta(I)), y^{(k)})

where Lk\mathcal{L}_k is the loss for task τk\tau_k and λk\lambda_k is a task weight. The key property: fθf_\theta is shared across all tasks and pretrained on data far larger than any single task dataset.

Definition

Sequence-to-Sequence Vision (Florence-2)

Florence-2 formulates all vision tasks as text generation. Given image II and task prompt pp (e.g., "Detect all objects"), the model generates output text ss that encodes the answer:

s^=argmaxst=1Tpθ(sts<t,Enc(I),p)\hat{s} = \arg\max_{s} \prod_{t=1}^{T} p_\theta(s_t | s_{<t}, \text{Enc}(I), p)

Task-specific output formats:

  • Classification: "cat" (a single label)
  • Detection: "cat [x1,y1,x2,y2]; dog [x1,y1,x2,y2]" (labels with coordinates)
  • Captioning: "A cat sitting on a red couch" (natural language)
  • Segmentation: polygon coordinate sequences encoding region boundaries

All tasks share the same encoder-decoder architecture with no task-specific heads.

Core Definitions

Florence (Microsoft, 2021) is a vision foundation model pretrained on 900M image-text pairs using a contrastive objective similar to CLIP. It uses a hierarchical ViT (CoSwin Transformer) as the image encoder. Florence demonstrated strong transfer to classification, retrieval, detection, and segmentation using the same pretrained backbone with task-specific adapters.

Florence-2 (2024) reformulates all vision tasks as sequence-to-sequence generation. The architecture is a vision encoder (DaViT) plus a text decoder (transformer). The training data consists of 5.4 billion annotations across 126 million images, generated by a combination of specialized models and human annotation. Florence-2 comes in two sizes: 0.23B and 0.77B parameters.

Multimodal foundation models (GPT-4V, Gemini Vision, Claude Vision) extend this concept by integrating the vision encoder into a large language model. The LLM serves as both the task decoder and the reasoning engine. These models can handle open-ended visual questions that Florence-2 cannot, but they are much larger and more expensive to run.

Main Theorems

Proposition

Unified Task Formulation via Sequence Generation

Statement

Let T={τ1,,τK}\mathcal{T} = \{\tau_1, \ldots, \tau_K\} be a set of vision tasks. If each task τk\tau_k has an output that can be serialized as a text sequence s(k)Vs^{(k)} \in \mathcal{V}^* (using coordinate tokens for spatial outputs), then a single encoder-decoder model can be trained on all tasks jointly:

L(θ)=k=1KiDkt=1Tilogpθ(st(i,k)s<t(i,k),Enc(I(i)),pk)\mathcal{L}(\theta) = -\sum_{k=1}^{K} \sum_{i \in D_k} \sum_{t=1}^{T_i} \log p_\theta(s_t^{(i,k)} | s_{<t}^{(i,k)}, \text{Enc}(I^{(i)}), p_k)

where DkD_k is the dataset for task kk and pkp_k is the task-specific text prompt. The model shares all parameters across tasks. The prompt pkp_k alone determines the output format.

Florence-2 empirically shows that this joint training improves performance on each individual task compared to training separate models on the same data, suggesting positive transfer across vision tasks.

Intuition

Detection requires understanding what objects look like and where they are. Captioning requires understanding what objects look like and how they relate. Segmentation requires precise spatial understanding. By training on all tasks simultaneously, the shared encoder learns representations that capture appearance, location, and relationships jointly. Each task provides a different supervisory signal that enriches the shared representation.

Proof Sketch

No formal proof of positive transfer. The empirical evidence from Xiao et al. (2024) shows that Florence-2 trained on all tasks simultaneously outperforms single-task baselines by 2-5% across benchmarks. The hypothesized mechanism is multi-task regularization: detection annotations provide localization signal that helps segmentation, and captioning annotations provide semantic signal that helps classification.

Why It Matters

This formulation eliminates task-specific architecture design. Adding a new vision task requires only defining its text output format and collecting training data. No new model heads, no architecture changes. This is the same simplification that sequence-to-sequence brought to NLP (translation, summarization, and QA all became text generation problems).

Failure Mode

Serializing spatial outputs as text token sequences introduces quantization error. Bounding box coordinates are discretized to a fixed vocabulary (e.g., 1000 location bins), limiting spatial precision. For tasks requiring sub-pixel accuracy (medical imaging, satellite imagery), this quantization can be a bottleneck. The approach also scales poorly with output complexity: generating polygon coordinates for dense segmentation produces very long output sequences.

Florence-2 Architecture Details

The Florence-2 encoder (DaViT) processes the input image at multiple scales, producing a feature pyramid. These features are flattened into a sequence of visual tokens. The decoder is a standard transformer that generates output tokens with cross-attention to the visual token sequence.

Coordinate representation: spatial coordinates are encoded as discrete tokens from a vocabulary of 1000 bins per axis. A bounding box (x1,y1,x2,y2)(x_1, y_1, x_2, y_2) is encoded as four tokens. This means the model's spatial resolution is limited to 1/10001/1000 of the image dimension.

Florence-2 is pretrained in two phases:

  1. Image-level pretraining: contrastive learning on image-text pairs (similar to CLIP) plus image captioning
  2. Region-level pretraining: detection, region captioning, and referring expression grounding on 5.4B region annotations

From Florence to GPT-4V

Vision foundation models exist on a spectrum of capability and cost:

  • Florence-2 (0.23B-0.77B params): Fast, cheap, strong on standard vision tasks. Cannot do open-ended reasoning.
  • LLaVA, InternVL (7B-34B params): Open-source vision-language models. Can reason about images and follow complex instructions.
  • GPT-4V, Gemini, Claude (100B+ params): Full multimodal reasoning. Can handle arbitrary visual questions, multi-step reasoning, and tool use. Expensive per image.

The trade-off is clear: larger models handle more open-ended tasks but cost orders of magnitude more per inference. For structured extraction (detect objects, read text, segment regions), Florence-2 is sufficient and runs on a single consumer GPU. For "what is the architectural style of this building and how does it relate to the surrounding neighborhood," you need a multimodal LLM.

Canonical Examples

Example

Florence-2 multi-task inference

Given a single image of a street scene with the prompt "Detect all objects": the model outputs "car [120,340,450,520]; person [500,200,580,480]; traffic light [250,50,290,120]". With the prompt "Caption this image": the model outputs "A person crossing a street near a parked car with a traffic light overhead." Same encoder, same decoder, same weights. Only the prompt changes.

Common Confusions

Watch Out

Florence-2 is not a multimodal LLM

Florence-2 is a vision model that outputs text-formatted answers to predefined task types. It cannot hold a conversation, reason about abstract concepts, or follow arbitrary instructions. It is a vision specialist, not a general-purpose assistant. The text decoder is small (a few hundred million parameters) and optimized for structured outputs, not open-ended language generation.

Watch Out

More pretraining data does not always help

Florence-2 uses 5.4 billion annotations, but many are machine-generated by specialist models. These pseudo-labels contain errors that the foundation model can memorize. The quality of pretraining data matters at least as much as quantity. Noisy labels on rare object categories can actually hurt performance on those categories compared to smaller, cleaner datasets.

Watch Out

Zero-shot does not mean no training data is needed

Florence-2 can perform tasks it was trained on without task-specific fine-tuning (zero-shot with respect to downstream data). But it was trained on massive annotated data for those exact task types. It cannot perform genuinely novel tasks it has never seen during pretraining. The "zero-shot" label refers to the downstream dataset, not the pretraining data.

Exercises

ExerciseCore

Problem

Florence-2 uses 1000 coordinate bins per axis. What is the maximum localization error (in pixels) for a bounding box prediction on a 1024×7681024 \times 768 image?

ExerciseAdvanced

Problem

Explain why multi-task pretraining (detection + captioning + segmentation jointly) might improve detection accuracy compared to single-task detection pretraining, even when evaluated only on detection.

References

Canonical:

  • Yuan et al., Florence: A New Foundation Model for Computer Vision (2021), Microsoft Research, arXiv:2111.11432
  • Xiao et al., Florence-2: Advancing a Unified Representation for a Variety of Vision Tasks (2024), CVPR

Current:

  • Liu et al., Visual Instruction Tuning (LLaVA) (2023), NeurIPS

  • OpenAI, GPT-4V Technical Report (2023), for comparison with LLM-based multimodal models

  • Zhang et al., Dive into Deep Learning (2023), Chapters 14-17

Last reviewed: April 2026

Prerequisites

Foundations this topic depends on.

Next Topics