chapter eight

8 CLIP: a model to measure the similarity between image and text

This chapter covers

Compressing a text description and an image into the same latent space
Building and training a contrastive language image pretraining (CLIP) model to match text-image pairs
Measuring the similarity between an image and a text description
Using the trained CLIP model to select an image based on a text prompt
Using the OpenAI pretrained CLIP to select an image based on text

State-of-the-art text-to-image models such as DALL-E 2, Google’s Imagen, and Stable Diffusion are built on three foundational components:

A text encoder to convert language into a latent representation;
A mechanism for injecting text information into the image generation process; and
A diffusion model to generate realistic images from noise.

In previous chapters, we explored how diffusion models generate images and how to encode text information for machine learning. Now, we turn to the key bridge that connects text and vision: understanding how a model can “see” an image through the lens of natural language.

This is where CLIP (Contrastive Language-Image Pretraining) comes in.

8.1 How to train and use the CLIP model

8.1.1 How the CLIP model works

8.1.2 Select an image from Flickr 8k based on a text description

8.2 Prepare the training dataset

8 CLIP: a model to measure the similarity between image and text

This chapter covers

8.1 How to train and use the CLIP model

8.1.1 How the CLIP model works

8.1.2 Select an image from Flickr 8k based on a text description

8.2 Prepare the training dataset

8.2.1 Image-caption pairs in Flickr-8k

8.2.2 The DistilBERT tokenizer

8.2.3 Preprocess captions and images for training

8.3 Create a CLIP model

8.3.1 Create a text encoder

8.3.2 Create an image encoder

8.3.3 Build a CLIP model

8.4 Train and use the CLIP model

8.4.1 Train the CLIP model

8.4.2 Use the trained CLIP model to select images

8.4.3 Use OpenAI pretrained CLIP model to select images

8.5 Summary