Beyond Llms
Florence and Vision Foundation Models
Vision foundation models that unify classification, detection, segmentation, captioning, and VQA under a single pretrained backbone. Florence, Florence-2, and the path toward GPT-4V-style multimodal understanding.
Prerequisites
Why This Matters
For decades, computer vision was a collection of separate problems solved by separate models: one model for classification, another for detection, another for segmentation, another for captioning. Each required its own architecture, its own training data format, and its own evaluation protocol.
Vision foundation models collapse this into a single pretrained backbone that handles all vision tasks. Train once on massive data, adapt to any task with minimal fine-tuning or just a text prompt. This parallels how large language models unified NLP tasks, and it represents the same kind of consolidation happening in vision.
Mental Model
Think of a vision transformer pretrained on billions of image-text pairs as having learned a general-purpose "visual language." Given an image, it produces rich representations that encode objects, their locations, their relationships, and their semantic meaning. Different tasks are just different questions asked of these representations: "What is in this image?" (classification), "Where are the objects?" (detection), "Describe this image" (captioning).
Florence-2 takes this further by formulating every vision task as a sequence-to-sequence problem: image in, text out, where the text format encodes the task-specific answer.
Formal Setup and Notation
Vision Foundation Model
A vision foundation model is a pretrained model that maps images to representations useful across multiple downstream tasks. Formally, for a set of vision tasks with task-specific heads , the foundation model minimizes:
where is the loss for task and is a task weight. The key property: is shared across all tasks and pretrained on data far larger than any single task dataset.
Sequence-to-Sequence Vision (Florence-2)
Florence-2 formulates all vision tasks as text generation. Given image and task prompt (e.g., "Detect all objects"), the model generates output text that encodes the answer:
Task-specific output formats:
- Classification: "cat" (a single label)
- Detection: "cat [x1,y1,x2,y2]; dog [x1,y1,x2,y2]" (labels with coordinates)
- Captioning: "A cat sitting on a red couch" (natural language)
- Segmentation: polygon coordinate sequences encoding region boundaries
All tasks share the same encoder-decoder architecture with no task-specific heads.
Core Definitions
Florence (Microsoft, 2021) is a vision foundation model pretrained on 900M image-text pairs using a contrastive objective similar to CLIP. It uses a hierarchical ViT (CoSwin Transformer) as the image encoder. Florence demonstrated strong transfer to classification, retrieval, detection, and segmentation using the same pretrained backbone with task-specific adapters.
Florence-2 (2024) reformulates all vision tasks as sequence-to-sequence generation. The architecture is a vision encoder (DaViT) plus a text decoder (transformer). The training data consists of 5.4 billion annotations across 126 million images, generated by a combination of specialized models and human annotation. Florence-2 comes in two sizes: 0.23B and 0.77B parameters.
Multimodal foundation models (GPT-4V, Gemini Vision, Claude Vision) extend this concept by integrating the vision encoder into a large language model. The LLM serves as both the task decoder and the reasoning engine. These models can handle open-ended visual questions that Florence-2 cannot, but they are much larger and more expensive to run.
Main Theorems
Unified Task Formulation via Sequence Generation
Statement
Let be a set of vision tasks. If each task has an output that can be serialized as a text sequence (using coordinate tokens for spatial outputs), then a single encoder-decoder model can be trained on all tasks jointly:
where is the dataset for task and is the task-specific text prompt. The model shares all parameters across tasks. The prompt alone determines the output format.
Florence-2 empirically shows that this joint training improves performance on each individual task compared to training separate models on the same data, suggesting positive transfer across vision tasks.
Intuition
Detection requires understanding what objects look like and where they are. Captioning requires understanding what objects look like and how they relate. Segmentation requires precise spatial understanding. By training on all tasks simultaneously, the shared encoder learns representations that capture appearance, location, and relationships jointly. Each task provides a different supervisory signal that enriches the shared representation.
Proof Sketch
No formal proof of positive transfer. The empirical evidence from Xiao et al. (2024) shows that Florence-2 trained on all tasks simultaneously outperforms single-task baselines by 2-5% across benchmarks. The hypothesized mechanism is multi-task regularization: detection annotations provide localization signal that helps segmentation, and captioning annotations provide semantic signal that helps classification.
Why It Matters
This formulation eliminates task-specific architecture design. Adding a new vision task requires only defining its text output format and collecting training data. No new model heads, no architecture changes. This is the same simplification that sequence-to-sequence brought to NLP (translation, summarization, and QA all became text generation problems).
Failure Mode
Serializing spatial outputs as text token sequences introduces quantization error. Bounding box coordinates are discretized to a fixed vocabulary (e.g., 1000 location bins), limiting spatial precision. For tasks requiring sub-pixel accuracy (medical imaging, satellite imagery), this quantization can be a bottleneck. The approach also scales poorly with output complexity: generating polygon coordinates for dense segmentation produces very long output sequences.
Florence-2 Architecture Details
The Florence-2 encoder (DaViT) processes the input image at multiple scales, producing a feature pyramid. These features are flattened into a sequence of visual tokens. The decoder is a standard transformer that generates output tokens with cross-attention to the visual token sequence.
Coordinate representation: spatial coordinates are encoded as discrete tokens from a vocabulary of 1000 bins per axis. A bounding box is encoded as four tokens. This means the model's spatial resolution is limited to of the image dimension.
Florence-2 is pretrained in two phases:
- Image-level pretraining: contrastive learning on image-text pairs (similar to CLIP) plus image captioning
- Region-level pretraining: detection, region captioning, and referring expression grounding on 5.4B region annotations
From Florence to GPT-4V
Vision foundation models exist on a spectrum of capability and cost:
- Florence-2 (0.23B-0.77B params): Fast, cheap, strong on standard vision tasks. Cannot do open-ended reasoning.
- LLaVA, InternVL (7B-34B params): Open-source vision-language models. Can reason about images and follow complex instructions.
- GPT-4V, Gemini, Claude (100B+ params): Full multimodal reasoning. Can handle arbitrary visual questions, multi-step reasoning, and tool use. Expensive per image.
The trade-off is clear: larger models handle more open-ended tasks but cost orders of magnitude more per inference. For structured extraction (detect objects, read text, segment regions), Florence-2 is sufficient and runs on a single consumer GPU. For "what is the architectural style of this building and how does it relate to the surrounding neighborhood," you need a multimodal LLM.
Canonical Examples
Florence-2 multi-task inference
Given a single image of a street scene with the prompt "Detect all objects": the model outputs "car [120,340,450,520]; person [500,200,580,480]; traffic light [250,50,290,120]". With the prompt "Caption this image": the model outputs "A person crossing a street near a parked car with a traffic light overhead." Same encoder, same decoder, same weights. Only the prompt changes.
Common Confusions
Florence-2 is not a multimodal LLM
Florence-2 is a vision model that outputs text-formatted answers to predefined task types. It cannot hold a conversation, reason about abstract concepts, or follow arbitrary instructions. It is a vision specialist, not a general-purpose assistant. The text decoder is small (a few hundred million parameters) and optimized for structured outputs, not open-ended language generation.
More pretraining data does not always help
Florence-2 uses 5.4 billion annotations, but many are machine-generated by specialist models. These pseudo-labels contain errors that the foundation model can memorize. The quality of pretraining data matters at least as much as quantity. Noisy labels on rare object categories can actually hurt performance on those categories compared to smaller, cleaner datasets.
Zero-shot does not mean no training data is needed
Florence-2 can perform tasks it was trained on without task-specific fine-tuning (zero-shot with respect to downstream data). But it was trained on massive annotated data for those exact task types. It cannot perform genuinely novel tasks it has never seen during pretraining. The "zero-shot" label refers to the downstream dataset, not the pretraining data.
Exercises
Problem
Florence-2 uses 1000 coordinate bins per axis. What is the maximum localization error (in pixels) for a bounding box prediction on a image?
Problem
Explain why multi-task pretraining (detection + captioning + segmentation jointly) might improve detection accuracy compared to single-task detection pretraining, even when evaluated only on detection.
References
Canonical:
- Yuan et al., Florence: A New Foundation Model for Computer Vision (2021), Microsoft Research, arXiv:2111.11432
- Xiao et al., Florence-2: Advancing a Unified Representation for a Variety of Vision Tasks (2024), CVPR
Current:
-
Liu et al., Visual Instruction Tuning (LLaVA) (2023), NeurIPS
-
OpenAI, GPT-4V Technical Report (2023), for comparison with LLM-based multimodal models
-
Zhang et al., Dive into Deep Learning (2023), Chapters 14-17
Last reviewed: April 2026
Prerequisites
Foundations this topic depends on.
- Vision Transformer LineageLayer 4
- Transformer ArchitectureLayer 4
- Attention Mechanism TheoryLayer 4
- Matrix Operations and PropertiesLayer 0A
- Sets, Functions, and RelationsLayer 0A
- Basic Logic and Proof TechniquesLayer 0A
- Softmax and Numerical StabilityLayer 1
- Feedforward Networks and BackpropagationLayer 2
- Differentiation in RnLayer 0A
- Matrix CalculusLayer 1
- The Jacobian MatrixLayer 0A
- The Hessian MatrixLayer 0A
- Activation FunctionsLayer 1
- Convex Optimization BasicsLayer 1
- Convolutional Neural NetworksLayer 3
- Vectors, Matrices, and Linear MapsLayer 0A
- Self-Supervised VisionLayer 4