CLIP: Contrastive Language-Image Pre-training

generative-models

CLIP

contrastive-learning

zero-shot

vision-transformers

multi-modal

An introduction to CLIP (Contrastive Language-Image Pre-training), covering the limitations of traditional vision models, natural language as a training signal, the contrastive pre-training objective, architecture choices, zero-shot transfer via prompt engineering, and CLIP’s robustness to distribution shift.

Published

March 22, 2026

Abstract

This lesson introduces CLIP (Contrastive Language-Image Pre-training), a model that learns visual representations from natural language supervision rather than fixed class labels. We begin by examining the limitations of traditional vision models that are trained on predetermined object categories and contrast them with the flexibility of NLP models that learn from raw text. We then develop the CLIP approach: jointly training an image encoder and a text encoder with a contrastive objective to align images and text in a shared embedding space. We walk through the architecture choices (ResNet and Vision Transformer image encoders, Transformer text encoder), the symmetric cross-entropy loss function, and the construction of the 400-million-pair WebImageText dataset. We show how CLIP enables powerful zero-shot transfer to new classification tasks through prompt engineering and ensembling, and conclude with CLIP’s remarkable robustness to distribution shift and its broader limitations.

CREATED AT: 2026-03-22

Introduction

In the previous lesson, we studied diffusion models — generative models that learn to reverse a gradual noising process to produce high-quality samples. Diffusion models can generate impressive images, but a natural question arises: how do we control what these models generate? How do we connect visual content to the rich semantics of natural language?

This lesson introduces CLIP (Contrastive Language-Image Pre-training), developed by Radford et al. at OpenAI. CLIP represents a paradigm shift in computer vision: rather than training models to predict a fixed set of class labels, CLIP learns visual representations directly from natural language descriptions paired with images. The result is a model that can perform zero-shot classification on arbitrary datasets — without ever seeing a single labelled example from those datasets — simply by being told, in natural language, what categories to look for.

CLIP is also a critical building block for modern text-to-image generation systems. Models like DALL-E 2 and Stable Diffusion use CLIP’s shared image-text embedding space to guide diffusion models toward generating images that match a given text prompt. Understanding CLIP is therefore essential for understanding the full pipeline of modern generative AI.

Part 1 — Motivation and Background

1.1 The Limitations of Traditional Vision Models

State-of-the-art computer vision systems have traditionally been trained to predict a fixed set of predetermined object categories. For example, the canonical ImageNet benchmark defines exactly 1,000 classes. A model trained on ImageNet learns to distinguish between these 1,000 categories — and nothing else.

This restricted form of supervision creates a fundamental limitation in generality and usability:

To recognise any new visual concept, additional labelled data must be collected and the model must be retrained or fine-tuned.
Crowd-sourced annotation is expensive and slow, requiring the canonical “1-of-N majority vote” format to produce gold labels.
The resulting models are narrow: they can only classify inputs into the categories they were trained on.

Training these models also requires enormous compute. For instance, Mahajan et al. (2018) required 19 GPU years to train their ResNeXt101-32x48d, and Xie et al. (2020) required 33 TPUv3 core-years to train Noisy Student EfficientNet-L2. Both systems were still limited to predicting just 1,000 ImageNet classes.

Contrast this with the field of natural language processing (NLP), where task-agnostic objectives such as autoregressive and masked language modeling have scaled successfully across many orders of magnitude in compute, model capacity, and data. The development of standardised text-to-text interfaces (e.g., GPT-3) has enabled task-agnostic architectures to zero-shot transfer to downstream tasks without any task-specific training data. Could a similar breakthrough happen in computer vision?

1.2 Natural Language as a Training Signal

The core idea behind CLIP is to learn visual representations directly from raw text paired with images, rather than from curated class labels. This approach offers several key advantages:

Abundance: Natural language supervision is vastly more available on the internet than curated label datasets. Hundreds of millions of images exist with associated captions, titles, alt-text, and descriptions — all freely available without any annotation effort.
No annotation cost: Unlike crowd-sourced labelling, natural language supervision can be collected passively from existing data on the web.
Flexible transfer: Because the learned representations are connected to language, they naturally enable zero-shot transfer. Instead of defining a fixed label set, we can describe new categories in natural language and use the model immediately.
Richer supervision: Natural language can express a much wider set of visual concepts than any finite label set. A caption like “a golden retriever playing fetch on a sunny beach” conveys far more information than the label “dog.”

The idea of learning from text paired with images is not new. Over 20 years ago, Mori et al. (1999) explored training models to predict nouns and adjectives in captions paired with images. More recent work includes bag-of-words contrastive models, and modern approaches such as VirTex (Desai & Johnson, 2020), ICMLM (Bulent Sariyildiz et al., 2020), and ConVIRT (Zhang et al., 2020), which demonstrated the potential of transformer-based language modeling and contrastive objectives for learning image representations from text. However, these prior methods were trained on relatively small datasets and achieved performance well below the state of the art. CLIP closes this gap by scaling to a much larger dataset and using an efficient contrastive objective.

Key Insight

The shift from fixed class labels to natural language supervision is a paradigm change in computer vision. Instead of building models that answer “which of these 1,000 categories does this image belong to?”, we build models that understand the relationship between images and arbitrary text descriptions. This enables open-vocabulary visual understanding.

Part 2 — The CLIP Approach

2.1 The Core Idea: Contrastive Pre-training

CLIP jointly trains an image encoder and a text encoder to produce embeddings in a shared multi-modal embedding space. The key insight is that the training objective does not try to predict the exact words of a caption given an image (a very hard task). Instead, CLIP solves a much simpler proxy task: given a batch of image-text pairs, determine which image goes with which text.

Formally, given a batch of \(N\) (image, text) pairs, CLIP learns to:

Maximise the cosine similarity of the \(N\) correct (image, text) pairings.
Minimise the cosine similarity of the \(N^2 - N\) incorrect pairings.

This is a contrastive objective: the model learns by contrasting correct pairs against incorrect ones within each training batch.

The switch from a predictive objective (predicting exact caption words) to a contrastive objective (predicting which text goes with which image) is critical. The authors found that a 63-million-parameter transformer language model, which already uses twice the compute of a ResNet-50 image encoder, learned to recognise ImageNet classes three times slower than a simpler bag-of-words baseline that predicts which text is paired with which image. Swapping to the contrastive objective provided a further 4x improvement in training efficiency.

2.2 The Training Dataset: WebImageText (WIT)

Existing datasets were insufficient for CLIP’s ambitions:

MS-COCO and Visual Genome are high-quality but small (approximately 100,000 training images each).
YFCC100M has 100 million images, but the metadata is sparse and low-quality. After filtering for English-language titles and descriptions, only about 15 million images remained — roughly the same size as ImageNet.

To address this, the authors constructed a new dataset called WebImageText (WIT): 400 million (image, text) pairs collected from publicly available internet sources. To cover a broad set of visual concepts, they searched for image-text pairs whose text included one of a set of 500,000 queries. The queries were constructed from all words occurring at least 100 times in English Wikipedia, augmented with bi-grams and WordNet synsets. Results were balanced to include up to 20,000 pairs per query.

The resulting dataset has a similar total word count to the WebText dataset used to train GPT-2.

2.3 Architecture Choices

CLIP’s architecture consists of two parallel encoders that project images and text into a shared embedding space.

Image Encoder

The authors explored two families of image encoders:

Modified ResNet: Starting from ResNet-50, the authors applied several improvements:
- ResNet-D modifications (He et al., 2019) for improved downsampling.
- Antialiased rect-2 blur pooling (Zhang, 2019) for shift invariance.
- Replacement of the global average pooling layer with an attention pooling mechanism — a single layer of transformer-style multi-head QKV attention where the query is conditioned on the global average-pooled representation.
- Scaling follows EfficientNet-style compound scaling across width, depth, and resolution.
Vision Transformer (ViT): The authors closely followed the ViT architecture (Dosovitskiy et al., 2020) with only minor modifications: an additional layer normalisation is applied to the combined patch and position embeddings before the transformer. The ViT models are approximately 3x more compute-efficient than the ResNet models for the same performance level.

The models trained range from a ResNet-50 up to a ResNet-50x64 (64x the compute of ResNet-50), and from ViT-B/32 up to ViT-L/14.

Text Encoder

The text encoder is a Transformer (Vaswani et al., 2017) with the following specifications:

63 million parameters
12 layers, 512-wide, with 8 attention heads
Operates on byte pair encoding (BPE) tokenised text with a 49,152 token vocabulary
Maximum sequence length of 76 tokens
Text sequences are bracketed with [SOS] and [EOS] tokens
The activation at the [EOS] token position in the highest layer serves as the text feature representation
This representation is layer-normalised and then linearly projected into the shared embedding space
Uses masked self-attention (causal attention) to preserve the ability to initialise with a pre-trained language model

Projection to Shared Space

Both encoder outputs are linearly projected into the shared multi-modal embedding space. Importantly, the authors did not use non-linear projections (as in some prior work). They found no difference in training efficiency between linear and non-linear projections and speculate that non-linear projections may be co-adapted with details of the image encoder in self-supervised methods. In the shared space, cosine similarity is used to compare image and text embeddings.

2.4 The Symmetric Loss Function

The training procedure can be described concisely in pseudocode. Given a batch of \(N\) (image, text) pairs:

Extract features from both encoders:
- \(\mathbf{I}_f = \text{image\_encoder}(I)\) with shape \([N, d_i]\)
- \(\mathbf{T}_f = \text{text\_encoder}(T)\) with shape \([N, d_t]\)
Project into the shared embedding space via learned linear projections \(W_i\) and \(W_t\):
- \(\mathbf{I}_e = \ell_2\text{-normalize}(\mathbf{I}_f \cdot W_i)\) with shape \([N, d_e]\)
- \(\mathbf{T}_e = \ell_2\text{-normalize}(\mathbf{T}_f \cdot W_t)\) with shape \([N, d_e]\)
Compute the scaled pairwise cosine similarity matrix: \[\text{logits} = \mathbf{I}_e \cdot \mathbf{T}_e^\top \cdot \exp(\tau)\] where \(\tau\) is a learned temperature parameter (log-parameterised to ensure positivity), initialised to the equivalent of 0.07.
Optimise a symmetric cross-entropy loss. The correct pairings form the diagonal of the \(N \times N\) similarity matrix, so the labels are simply \([0, 1, 2, \ldots, N-1]\): \[\mathcal{L} = \frac{1}{2}\left(\text{CE}_{\text{rows}}(\text{logits}, \text{labels}) + \text{CE}_{\text{cols}}(\text{logits}, \text{labels})\right)\]

The row-wise cross-entropy treats each image as a query and asks “which text matches this image?” The column-wise cross-entropy treats each text as a query and asks “which image matches this text?” Averaging both ensures the loss is symmetric.

This loss is equivalent to the multi-class N-pair loss (Sohn, 2016) and the InfoNCE loss (Oord et al., 2018) from contrastive representation learning.

Show code

# Pseudocode for the core of CLIP's training loop (NumPy-like)

# image_encoder  - ResNet or Vision Transformer
# text_encoder   - Transformer
# I[n, h, w, c]  - minibatch of aligned images
# T[n, l]         - minibatch of aligned texts
# W_i[d_i, d_e]  - learned projection from image features to shared embedding
# W_t[d_t, d_e]  - learned projection from text features to shared embedding
# t              - learned (log) temperature parameter

# Step 1: Extract feature representations from each modality
I_f = image_encoder(I)    # [n, d_i]
T_f = text_encoder(T)     # [n, d_t]

# Step 2: Project into shared multi-modal embedding space and L2-normalise
I_e = l2_normalize(np.dot(I_f, W_i), axis=1)   # [n, d_e]
T_e = l2_normalize(np.dot(T_f, W_t), axis=1)   # [n, d_e]

# Step 3: Compute scaled pairwise cosine similarities
logits = np.dot(I_e, T_e.T) * np.exp(t)        # [n, n]

# Step 4: Symmetric cross-entropy loss
labels = np.arange(n)
loss_i = cross_entropy_loss(logits, labels, axis=0)  # image-to-text
loss_t = cross_entropy_loss(logits, labels, axis=1)  # text-to-image
loss   = (loss_i + loss_t) / 2

Note on Training Scale

CLIP was trained with a very large minibatch size of 32,768. The largest ResNet model (RN50x64) took 18 days to train on 592 V100 GPUs, while the largest Vision Transformer (ViT-L/14) took 12 days on 256 V100 GPUs. All models were trained for 32 epochs. Mixed-precision training, gradient checkpointing, and half-precision Adam statistics were used to manage memory. The best-performing model, ViT-L/14@336px, was additionally fine-tuned at a higher resolution of 336 pixels for one extra epoch.

Part 3 — Zero-Shot Transfer

3.1 The Zero-Shot Classification Pipeline

CLIP’s contrastive pre-training enables a remarkably simple approach to image classification on any dataset, without any task-specific training. The pipeline has three steps:

Step 1: Pre-train CLIP contrastively on (image, text) pairs, as described above.

Step 2: Embed class names. At test time, take all class names of the target dataset and pass them through the text encoder using prompt templates. For example, for ImageNet’s 1,000 classes, encode the texts “A photo of a {class name}.” for each class. This produces a set of \(K\) text embeddings, one per class.

Step 3: Classify. For a given test image, compute its image embedding, then compute the cosine similarity between the image embedding and all \(K\) text embeddings. Predict the class with the highest similarity.

Crucially, CLIP effectively synthesises a zero-shot linear classifier at test time. The text encoder acts as a hypernetwork that generates classifier weights from natural language descriptions of the classes. The cosine similarities, scaled by the temperature parameter, are passed through a softmax to produce a probability distribution over classes — this is mathematically equivalent to a multinomial logistic regression classifier with L2-normalised inputs, L2-normalised weights, no bias, and temperature scaling.

3.2 Prompt Engineering and Ensembling

Using bare class names (e.g., just the word “crane”) as input to the text encoder leads to two problems:

Polysemy: Many words have multiple meanings. “Crane” could refer to a bird or a piece of construction equipment. “Boxer” could be a dog breed or an athlete. Without context, the text encoder cannot disambiguate.
Distribution gap: During pre-training, the text encoder sees full sentences describing images. A bare class name like “crane” looks very different from the training distribution, which consists of natural language sentences.

Prompt engineering addresses both problems by wrapping class names in descriptive templates that provide context:

Default template: "A photo of a {label}."
For Oxford-IIIT Pets: "A photo of a {label}, a type of pet."
For Food101: "A photo of {label}, a type of food."
For FGVC Aircraft: "A photo of a {label}, a type of aircraft."
For satellite imagery: "A satellite photo of a {label}."
For OCR tasks: putting quotes around the text or number to be recognised

Just using the default prompt "A photo of a {label}." improves accuracy on ImageNet by 1.3% over using bare class names.

Prompt ensembling further improves performance by computing predictions from 80 different prompt templates and averaging the resulting text embeddings in the embedding space. On ImageNet, ensembling over 80 prompts provides an additional 3.5% accuracy improvement. The combined effect of prompt engineering and ensembling improves ImageNet accuracy by approximately 5% over the bare class-name baseline.

Importantly, the ensemble is constructed over the embedding space rather than probability space. This means the averaged text embeddings can be cached as a single set of classifier weights, making the compute cost of the ensemble the same as using a single prompt when amortised over many predictions.

Prompt Engineering is “Free” Performance

Prompt engineering and ensembling boost zero-shot classification performance by almost 5 points on average across 36 datasets. This improvement is similar to the gain from using 4x more compute during pre-training, but it is effectively “free” when amortised over many predictions — the text embeddings only need to be computed once per dataset.

3.3 Benchmark Results

The zero-shot transfer results are striking:

ImageNet: Zero-shot CLIP matches the accuracy of a fully supervised ResNet-50 at 76.2% top-1 accuracy, without using any of the 1.28 million ImageNet training examples. CLIP’s top-5 accuracy reaches 95%, matching Inception-V4.
Comparison to prior zero-shot methods: The previous best zero-shot method, Visual N-Grams (Li et al., 2017), achieved only 11.5% on ImageNet. CLIP improves this to 76.2% — a massive leap.
Comparison to supervised linear probes: Across a 27-dataset evaluation suite, zero-shot CLIP outperforms a fully supervised linear classifier fitted on ResNet-50 features on 16 out of 27 datasets. Zero-shot CLIP is particularly strong on action recognition datasets (outperforming ResNet-50 by 14.5% on Kinetics700 and 7.7% on UCF101), likely because natural language provides wider supervision for visual concepts involving verbs.
Comparison to few-shot methods: Zero-shot CLIP matches the average performance of a 4-shot linear classifier trained on CLIP’s own feature space, and nearly matches the best results of a 16-shot classifier across publicly available models.

However, zero-shot CLIP struggles on several kinds of tasks:

Satellite imagery (EuroSAT, RESISC45)
Medical imaging (PatchCamelyon — lymph node tumour detection)
Counting (CLEVRCounts)
Fine-grained classification of cars, flowers, and aircraft (Stanford Cars, Flowers102, FGVCAircraft)
Self-driving related tasks (GTSRB — German traffic signs, KITTI Distance)
Handwritten digits (MNIST — where simple logistic regression on raw pixels outperforms zero-shot CLIP)

These failures highlight that CLIP’s pre-training data, while vast, does not adequately cover all visual domains.

Part 4 — Robustness and Broader Implications

4.1 Distribution Shift Robustness

A well-documented problem in computer vision is the robustness gap: models trained on ImageNet suffer large accuracy drops when evaluated on naturally shifted versions of the dataset, even when human accuracy remains high. For example, a ResNet-101 trained on ImageNet achieves 76.2% on the standard validation set, but only:

64.3% on ImageNetV2 (a faithful reproduction of the original test set)
37.7% on ImageNet-R (renditions: art, cartoons, sculptures)
32.6% on ObjectNet (objects in unusual poses and contexts)
25.2% on ImageNet Sketch
2.7% on ImageNet-A (adversarially filtered natural images)

The common explanation is that deep learning models exploit spurious correlations — patterns that hold in the training distribution but do not generalise to other distributions.

Zero-shot CLIP tells a very different story. With the same 76.2% accuracy on standard ImageNet, zero-shot CLIP achieves:

70.1% on ImageNetV2 (+5.8%)
88.9% on ImageNet-R (+51.2%)
72.3% on ObjectNet (+39.7%)
60.2% on ImageNet Sketch (+35.0%)
77.1% on ImageNet-A (+74.4%)

Overall, zero-shot CLIP models shrink the robustness gap by up to 75% compared to standard ImageNet models.

The intuition is clear: a zero-shot model cannot exploit spurious correlations specific to a training distribution, because it was never trained on that distribution. Its classifier weights are generated entirely from natural language descriptions of the classes, not from statistical patterns in training images.

Key Finding: Robustness

Zero-shot CLIP models are dramatically more robust to distribution shift than supervised ImageNet models of equivalent accuracy. On ImageNet-R, zero-shot CLIP achieves 88.9% compared to 37.7% for a ResNet-101 — a difference of over 50 percentage points. This suggests that the robustness gap observed in supervised models is largely an artefact of training on a specific distribution, not a fundamental limitation of deep learning.

4.2 Limitations and Ethical Considerations

Despite its impressive capabilities, CLIP has several important limitations:

Performance limitations:

Poor data efficiency in few-shot settings: While CLIP excels at zero-shot transfer, its few-shot performance (using a linear classifier on CLIP features with a small number of labelled examples) does not improve as rapidly as humans learning from examples. Humans go from 54% to 76% accuracy with just one example per class, while CLIP’s few-shot improvement is more gradual.
Weak on specialised tasks: CLIP struggles with satellite imagery, medical imaging, counting, fine-grained classification, and tasks not well-represented in its pre-training data.
Brittle generalisation: CLIP still generalises poorly to data that is truly out-of-distribution for its pre-training set. A striking example is MNIST: despite being one of the simplest benchmarks in computer vision, CLIP achieves only 88% accuracy, because handwritten digits are rare in internet image-text data.
Limited to classification: CLIP can only choose among provided concepts. It cannot generate novel outputs like a captioning model or describe previously unseen visual phenomena.

Ethical considerations:

Social biases: CLIP is trained on unfiltered internet data, which contains many social biases. These biases transfer into the model’s representations and can affect downstream applications.
Surveillance potential: CLIP’s flexibility makes it easy to create custom classifiers for any visual concept, including potentially harmful applications in surveillance. The same capability that makes CLIP useful for benign tasks (identifying plant species, classifying art styles) could be repurposed for tracking individuals or monitoring activities.
Unfiltered training data: The image-text pairs are collected from the internet without curation, raising concerns about the quality and biases present in the training signal.

Summary

In this lesson, we have studied CLIP (Contrastive Language-Image Pre-training), a model that represents a paradigm shift in computer vision:

Traditional vision models are limited by fixed class labels, expensive annotation, and narrow generality. NLP, by contrast, has shown that learning from raw text at scale enables flexible zero-shot transfer.
CLIP’s approach is to jointly train an image encoder and a text encoder with a contrastive objective, learning to match images with their corresponding text descriptions in a shared embedding space. The contrastive objective (predicting which text goes with which image) is far more efficient than predicting exact caption words.
The architecture consists of a ResNet or Vision Transformer image encoder and a Transformer text encoder, with linear projections into a shared space. A symmetric cross-entropy loss over the cosine similarity matrix (equivalent to InfoNCE) is used for training, with a learned temperature parameter.
Zero-shot transfer is achieved by embedding class names through the text encoder using prompt templates, then classifying images by cosine similarity. Prompt engineering and ensembling provide significant accuracy gains. Zero-shot CLIP matches a fully supervised ResNet-50 on ImageNet (76.2%) and outperforms supervised linear probes on 16 out of 27 benchmarks.
Robustness: Zero-shot CLIP is dramatically more robust to distribution shift than supervised models, shrinking the robustness gap by up to 75%. This robustness arises because zero-shot classifiers cannot exploit dataset-specific spurious correlations.
Limitations include poor few-shot data efficiency, weakness on specialised tasks, brittle generalisation to truly out-of-distribution data, and social biases inherited from unfiltered internet training data.

CLIP’s shared image-text embedding space is a critical component in modern text-to-image generation systems, where it provides the bridge between natural language prompts and the visual content produced by diffusion models.

References

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, Ilya Sutskever, “Learning Transferable Visual Models From Natural Language Supervision”,