This article delves into the workings of OpenAI’s CLIP model, drawing insights from the OpenAI Engineering Team’s publicly shared information. The analysis and interpretations presented here are the author’s own, with full credit for the technical details belonging to the original OpenAI researchers. Links to the original resources are provided in the references section. The goal is to offer a clear explanation of CLIP, and any inaccuracies or omissions are unintentional and will be promptly addressed upon notification.
Imagine a computer that learns to identify objects not through vast collections of meticulously labeled images, but by exploring the internet and gleaning knowledge from the natural language used to describe visuals. This is the essence of OpenAI’s CLIP, a model that signifies a paradigm shift in how machines are taught to comprehend visual information.
CLIP, short for Contrastive Language-Image Pre-training, is a neural network that bridges the gap between vision and language. Introduced in January 2021, its remarkable ability lies in classifying images into user-defined categories without requiring specific training for each task. By simply providing descriptions in plain English, CLIP can recognize the desired elements within an image. This “zero-shot” capability distinguishes CLIP from most preceding computer vision systems.
This article explores the functionality of CLIP and the challenges it aims to overcome.
Traditional computer vision relied on a structured methodology. Training a model to differentiate between cats and dogs necessitated thousands of labeled images of each. Similarly, recognizing various car models demanded another extensive dataset. To illustrate, ImageNet, a prominent image dataset, required over 25,000 individuals to label its 14 million images.
This conventional approach presented three significant drawbacks:
For instance, a model achieving 76% accuracy on ImageNet might experience a significant drop to 37% when presented with sketches of the same objects, or even plummet to 2.7% when exposed to slightly altered images. This indicated that models were learning the quirks of ImageNet rather than developing a genuine understanding of visual concepts.
CLIP adopts a fundamentally different approach. Instead of relying on carefully curated labeled datasets, it learns from 400 million image-text pairs gathered from across the internet. These pairings are ubiquitous online, found in Instagram photos with captions, news articles with accompanying images, product listings with descriptions, and Wikipedia entries with illustrations. The natural tendency for people to describe, explain, or comment on images provides a vast source of training data.
However, CLIP does not aim to predict specific category labels. Instead, it learns to associate images with their corresponding textual descriptions. During training, CLIP is presented with an image and a large batch of text snippets (32,768 at a time). Its task is to identify the text snippet that best corresponds to the image.
Consider it a large-scale matching game. For example, if the system is shown a photo of a golden retriever playing in a park, it must select the correct description, such as “a golden retriever playing fetch in the park,” from a pool of 32,768 options. The remaining options might include descriptions like “a black cat sleeping,” “a mountain landscape at sunset,” “a person eating pizza,” and countless others. To consistently identify the correct match across millions of such examples, CLIP must learn to recognize objects, scenes, actions, and attributes, and understand their linguistic representations.
By repeatedly engaging in this matching task with diverse internet data, CLIP develops a profound understanding of visual concepts and their linguistic counterparts. For example, it might learn that furry, four-legged animals with wagging tails are associated with words like “dog” and “puppy,” or that orange and pink skies over water are linked to “sunset” and “beach.” In essence, it constructs a comprehensive mental model that connects the visual and linguistic realms.
At its core, CLIP employs two distinct neural networks that operate in conjunction: an image encoder and a text encoder.
The image encoder processes raw pixels and transforms them into a numerical vector, known as an embedding. Similarly, the text encoder processes words and sentences, also producing a vector. A crucial aspect is that both encoders generate vectors within the same dimensional space, enabling direct comparison.
Initially, these encoders may produce random, meaningless vectors. For instance, an image of a dog might be represented as [0.2, -0.7, 0.3, …], while the text “dog” is represented as [-0.5, 0.1, 0.9, …]. These numbers have no inherent relationship. However, the training process introduces a transformative effect.
The training process utilizes a contrastive loss function, which serves as a mathematical measure of the model’s current error. For correct image-text pairs (such as a dog image with “dog playing fetch”), the loss function dictates that the embeddings should be highly similar. Conversely, for incorrect pairs (such as a dog image with “cat sleeping”), the embeddings should be significantly different. The loss function generates a single value representing the overall error across all images and texts within a batch.

Backpropagation, the fundamental learning mechanism in neural networks, then calculates how each weight in both encoders should be adjusted to minimize this error. The weights are subtly updated, and the process is repeated millions of times with different batches of data. Gradually, both encoders learn to produce similar vectors for matching concepts. For example, images of dogs begin to generate vectors that are close to where the text encoder places the word “dog.”
In other words, through the constant pressure to match correct pairs and differentiate incorrect ones across millions of diverse examples, the encoders evolve to communicate in a common language.
Once CLIP is trained, its zero-shot capabilities become readily apparent. Consider the task of classifying images as containing either dogs or cats. There is no need to retrain CLIP or provide it with labeled examples.
Instead, the image can be passed through the image encoder to obtain an embedding. Then, the text “a photo of a dog” can be passed through the text encoder to generate another embedding. Furthermore, the text “a photo of a cat” can be processed to obtain a third embedding. By comparing which text embedding is closer to the image embedding, the answer can be determined.

CLIP essentially asks: “Based on everything learned from the internet, is this image more likely to appear with text about dogs or text about cats?”
Since it has learned from such a diverse dataset, this approach is applicable to almost any classification task that can be described in words.
Need to classify types of food? Use “a photo of pizza,” “a photo of sushi,” or “a photo of tacos” as your categories. Want to analyze satellite imagery? Try “a satellite photo of a forest,” “a satellite photo of a city,” or “a satellite photo of farmland.” Working with medical images? You could use “an X-ray showing pneumonia” versus “an X-ray of healthy lungs.” Simply change the text descriptions, and no retraining is required.
This flexibility is transformative. Traditional models required extensive labeled datasets for each new task, while CLIP can tackle new tasks immediately, limited only by the ability to describe categories in natural language.
CLIP’s success was not solely attributable to its core concept. OpenAI made two critical technical decisions that rendered training computationally feasible.
First, they opted for contrastive learning over the more intuitive approach of training the model to generate image captions. Initial experiments involved teaching systems to analyze images and produce full text descriptions word by word, similar to how language models generate text. While seemingly logical, this approach proved to be extremely slow and computationally demanding. Generating entire sentences requires significantly more computation than simply learning to match images with text. Contrastive learning proved to be 4 to 10 times more efficient in achieving satisfactory zero-shot performance.
Second, they adopted Vision Transformers for the image encoder. Transformers, the architecture behind GPT and BERT, had already revolutionized natural language processing. Applying them to images (treating image patches like words in a sentence) provided another 3x computational efficiency gain over traditional convolutional neural networks like ResNet.

These combined choices enabled CLIP to be trained on 256 GPUs for two weeks, a similar timeframe to other large-scale vision models of the time, rather than requiring an astronomically larger amount of computing power.
OpenAI evaluated CLIP on over 30 different datasets encompassing various tasks, including fine-grained classification, optical character recognition, action recognition, geographic localization, and satellite imagery analysis.
The results validated CLIP’s approach. While matching ResNet-50’s 76.2% accuracy on the standard ImageNet dataset, CLIP outperformed the best publicly available ImageNet model on 20 out of 26 transfer learning benchmarks. More importantly, CLIP maintained robust performance on stress tests where traditional models faltered. On ImageNet Sketch, CLIP achieved 60.2% accuracy compared to ResNet’s 25.2%. When faced with adversarial examples, CLIP scored 77.1% compared to ResNet’s 2.7%.

However, the model still encounters challenges in certain areas, such as:
When tested on handwritten digits from the MNIST dataset (a task considered trivial in computer vision), CLIP achieved only 88% accuracy, significantly below the 99.75% human performance.
CLIP exhibits sensitivity to the phrasing of text prompts. It may require trial and error (“prompt engineering”) to discover effective wording.
CLIP inherits biases from its internet training data. The way categories are phrased can significantly influence model behavior in problematic ways.
Despite these limitations, CLIP demonstrates that the approach driving recent breakthroughs in natural language processing (learning from massive amounts of internet text) can be successfully applied to computer vision. Just as GPT models have learned to perform diverse language tasks by training on internet text, CLIP has learned diverse visual tasks by training on internet image-text pairs.
Since its release, CLIP has become a foundational component across the AI industry. Its open-source nature has fostered widespread adoption. Modern text-to-image systems like Stable Diffusion and DALL-E utilize CLIP-like models to interpret text prompts. Companies leverage it for image search, content moderation, and recommendations.