Beta. Content is under active construction and has not been peer-reviewed. Report errors on GitHub.Disclaimer

Beyond Llms

CLIP and OpenCLIP in Practice

CLIP learns a shared embedding space for images and text via contrastive learning on 400M pairs. Practical guide to zero-shot classification, image search, OpenCLIP variants, embedding geometry, and known limitations.

AdvancedTier 2Current~50 min
0

Why This Matters

Before CLIP, using a vision model on a new task required collecting labeled images for that specific task and fine-tuning. CLIP changed this: by learning a joint embedding space for images and text, it enables zero-shot image classification. Describe any category in words, and CLIP can classify images into it without ever seeing a labeled example of that category.

This is the same shift that happened in NLP with large language models: from task-specific supervised models to general-purpose pretrained models that adapt via natural language prompts. CLIP brought this capability to vision.

Mental Model

CLIP trains two encoders in parallel: one for images, one for text. Both encoders map their inputs to the same vector space. An image of a dog and the text "a photo of a dog" end up near each other. An image of a dog and the text "a photo of a car" end up far apart.

At inference, classification works by comparing: encode the image, encode each candidate label as text, pick the label whose text embedding is closest to the image embedding. No fine-tuning, no labeled data for the target task.

Formal Setup and Notation

Definition

CLIP Dual Encoder

CLIP consists of two encoders:

  • Image encoder fI:IRdf_I: \mathcal{I} \to \mathbb{R}^d (either a ResNet or a Vision Transformer)
  • Text encoder fT:VRdf_T: \mathcal{V}^* \to \mathbb{R}^d (a Transformer)

Both produce 2\ell_2-normalized embeddings of the same dimension dd. The similarity between image II and text TT is the cosine similarity:

sim(I,T)=fI(I)fT(T)\text{sim}(I, T) = f_I(I)^\top f_T(T)

since both vectors are unit-normalized.

Definition

Zero-Shot Classification with CLIP

Given an image II and KK class names {c1,,cK}\{c_1, \ldots, c_K\}, construct text prompts tkt_k = "a photo of a [class name]" for each class. The predicted class is:

y^=argmaxk{1,,K}fI(I)fT(tk)\hat{y} = \arg\max_{k \in \{1,\ldots,K\}} f_I(I)^\top f_T(t_k)

No task-specific training data is needed. The "training" is implicit in the choice of class names and prompt template.

Core Definitions

The CLIP training data consists of 400 million (image, text) pairs scraped from the internet. Each pair is an image and its associated caption or alt text. This is noisy, diverse, and massive. The scale and diversity of the data are what give CLIP its generalization ability.

OpenCLIP is an open-source reproduction of CLIP by LAION. It provides models trained on different datasets (LAION-400M, LAION-2B, DataComp) with various architectures (ViT-B/32, ViT-L/14, ViT-H/14, ViT-G/14) and training recipes. OpenCLIP allows researchers to study how data composition and model scale affect CLIP-style models.

Prompt engineering for CLIP means choosing the text template that best activates the relevant features. "a photo of a dog" works better than just "dog" because CLIP was trained on natural image-caption pairs. Ensembling multiple prompts ("a photo of a dog", "a picture of a dog", "a dog in a photograph") and averaging their text embeddings improves accuracy by 1-3% on standard benchmarks.

Main Theorems

Proposition

CLIP Symmetric Contrastive Loss

Statement

Given a batch of NN (image, text) pairs, let vi=fI(Ii)\mathbf{v}_i = f_I(I_i) and ui=fT(Ti)\mathbf{u}_i = f_T(T_i) be the normalized embeddings. The CLIP loss is a symmetric cross-entropy over the N×NN \times N similarity matrix:

L=12Ni=1N[logeτviuij=1Neτviujlogeτuivij=1Neτuivj]\mathcal{L} = \frac{1}{2N}\sum_{i=1}^{N}\left[-\log \frac{e^{\tau \cdot \mathbf{v}_i^\top \mathbf{u}_i}}{\sum_{j=1}^{N} e^{\tau \cdot \mathbf{v}_i^\top \mathbf{u}_j}} - \log \frac{e^{\tau \cdot \mathbf{u}_i^\top \mathbf{v}_i}}{\sum_{j=1}^{N} e^{\tau \cdot \mathbf{u}_i^\top \mathbf{v}_j}}\right]

where τ\tau is a learned temperature parameter (initialized to log(1/0.07)2.66\log(1/0.07) \approx 2.66). The first term treats each image as a query and finds its matching text among NN candidates. The second term does the reverse. The diagonal entries of the similarity matrix are the positive pairs.

Intuition

This is contrastive learning applied across modalities. Each image should be most similar to its own caption and dissimilar to all other captions in the batch. The symmetric formulation ensures both the image encoder and text encoder learn compatible representations. The temperature τ\tau controls how peaked the similarity distribution is: higher temperature means the model must produce more confident, separated embeddings.

Proof Sketch

No formal proof of generalization. The objective is a multi-class cross-entropy where the "classes" are the NN items in the batch. By the standard analysis of softmax cross-entropy, the gradient pushes positive pairs together and negative pairs apart in embedding space. The key empirical insight is that with N=32,768N = 32{,}768 (CLIP's batch size), each image has 32,767 negative texts, providing a rich contrastive signal that forces the model to learn fine-grained semantic distinctions.

Why It Matters

This objective is why CLIP generalizes to unseen tasks. By learning from 400M diverse image-text pairs with a contrastive objective, the model develops an embedding space where semantic similarity aligns with geometric proximity. Any concept expressible in text (including concepts not in any predefined label set) can be matched to images by computing cosine similarity.

Failure Mode

The contrastive objective assumes that non-matching pairs in the batch are true negatives. In large batches, this is approximately true but not guaranteed. Two different images in the batch could both depict "a dog on a beach," making them false negatives for each other's text. This noise is tolerable at scale but can hurt performance on fine-grained distinctions (e.g., distinguishing dog breeds). The temperature parameter τ\tau is sensitive: too high and the model overfits to batch statistics, too low and gradients vanish for hard negatives.

CLIP Embedding Space Geometry

The CLIP embedding space has notable geometric properties:

Modality gap: Image embeddings and text embeddings occupy different regions of the unit hypersphere. There is a measurable angular gap between the centroid of all image embeddings and the centroid of all text embeddings. This gap does not prevent retrieval (matching still works) but means that image-to-image similarity and text-to-text similarity are not directly comparable to image-to-text similarity.

Hubness: some embeddings become "hubs" that are nearest neighbors to many queries. A generic-looking image embedding might be the nearest neighbor for dozens of unrelated text queries. This distorts retrieval rankings and is a known problem in high-dimensional nearest neighbor search.

Prompt sensitivity: the text embedding for "dog" is different from "a photo of a dog" is different from "a cute dog." The choice of prompt template shifts the text embedding in the space, and the best template depends on the image distribution being classified.

Practical Uses

Image search: encode a text query and a database of images. Rank images by cosine similarity to the query. No labeled data or fine-tuning needed. Works for open-vocabulary queries that a fixed label set cannot cover.

Content moderation: encode images and compare against text descriptions of prohibited content. CLIP can flag images matching "violent content" or "explicit material" without training a specific classifier.

Zero-shot tagging: assign multiple tags to an image by setting a similarity threshold. All text labels above the threshold are assigned. This is multi-label classification without any labeled examples.

Embedding for RAG: use CLIP image embeddings alongside text embeddings in a retrieval-augmented generation system. Users can query a knowledge base with either text or images, and the retriever finds relevant content in either modality.

CLIP vs Fine-Tuned Classifiers

On ImageNet zero-shot classification, CLIP ViT-L/14 achieves 75.3% accuracy. A supervised ViT-L fine-tuned on ImageNet training data achieves 87.8%. The gap (12.5%) is the cost of zero-shot generality versus task-specific optimization.

When to use CLIP: (1) you do not have labeled data for your specific task, (2) the label set changes frequently or is open-ended, (3) you need to prototype quickly. When to fine-tune: (1) you have sufficient labeled data, (2) accuracy is critical, (3) the label set is fixed and well-defined.

Linear probing (training a linear classifier on frozen CLIP features) is a middle ground: it uses CLIP's representations but adapts to task-specific labels with minimal labeled data. This often closes half the gap between zero-shot and full fine-tuning.

Common Confusions

Watch Out

CLIP is not an object detector

CLIP classifies whole images. Given an image with a dog and a cat, CLIP embeds the entire scene. It cannot tell you where the dog is or draw a bounding box. For localization, you need models like OWL-ViT or GLIP that extend CLIP-style representations with detection heads. CLIP tells you what is in the image, not where.

Watch Out

CLIP struggles with counting and spatial reasoning

Ask CLIP to distinguish "two dogs" from "three dogs" and it often fails. Ask it to distinguish "a cat on a mat" from "a mat on a cat" and it often fails. The contrastive objective on image-caption pairs does not force the model to learn precise numeracy or spatial relationships. Captions rarely describe exact counts or spatial arrangements in sufficient detail.

Watch Out

Larger batch size is not always better

CLIP's original batch size of 32,768 is important for providing enough negative examples. But increasing beyond this point has diminishing returns and can hurt performance if the batch contains many false negatives. OpenCLIP experiments show that batch sizes of 32K-65K are optimal for most model sizes, with larger batches providing no benefit.

Watch Out

CLIP embeddings are not interchangeable across model variants

A CLIP ViT-B/32 embedding is not compatible with a CLIP ViT-L/14 embedding. They have different dimensions and different learned spaces. You cannot mix embeddings from different CLIP models in the same index or comparison. If you change the model, you must re-encode your entire database.

Canonical Examples

Example

Zero-shot classification on a custom domain

Suppose you have factory images and want to classify defects. Define text labels: "a photo of a scratch defect," "a photo of a dent defect," "a photo of a clean product." Encode each label with CLIP's text encoder. For each factory image, encode it and find the most similar label. This provides an immediate baseline without collecting or labeling defect images. Accuracy will be lower than a fine-tuned model, but development time is minutes instead of weeks.

Exercises

ExerciseCore

Problem

CLIP uses a batch size of N=32,768N = 32{,}768. The similarity matrix is N×NN \times N. How many scalar similarity computations are performed per batch, and how many of these are positive pairs?

ExerciseAdvanced

Problem

The learned temperature τ\tau in CLIP's loss function is initialized to 1/0.0714.31/0.07 \approx 14.3. Explain what happens if τ\tau is too large (say, τ=100\tau = 100) and what happens if τ\tau is too small (say, τ=1\tau = 1). Why is learning τ\tau better than fixing it?

ExerciseCore

Problem

You have a database of 1 million product images encoded with CLIP ViT-B/32 (512-dimensional embeddings). You want to switch to CLIP ViT-L/14 (768-dimensional embeddings) for better accuracy. Can you incrementally re-encode only new images with the new model and mix them with existing embeddings?

References

Canonical:

  • Radford et al., Learning Transferable Visual Models From Natural Language Supervision (CLIP) (2021), OpenAI, ICML, Sections 2-3
  • Cherti et al., Reproducible Scaling Laws for Contrastive Language-Image Learning (OpenCLIP) (2023), CVPR

Current:

  • Gadre et al., DataComp: In Search of the Next Generation of Multimodal Datasets (2023), NeurIPS

  • Liang et al., Mind the Gap: Understanding the Modality Gap in Multi-modal Contrastive Representation Learning (2022), NeurIPS

  • Zhang et al., Dive into Deep Learning (2023), Chapters 14-17

Last reviewed: April 2026

Prerequisites

Foundations this topic depends on.

Next Topics