chapter eight

8 CLIP: A model to measure the similarity between image and text

This chapter covers

  • Compressing a text description and an image into the same latent space
  • Building and training a CLIP model to match textimage pairs
  • Measuring textimage similarity
  • Using the trained CLIP model to select an image based on a text prompt

State-of-the-art text-to-image models such as DALL-E 2, Google’s Imagen, and Stable Diffusion are built on three foundational components: (1) a text encoder to convert language into a latent representation, (2) a mechanism for injecting text information into the image-generation process, and (3) a diffusion model to generate realistic images from noise.

In previous chapters, we explored how diffusion models generate images and how to encode text information for machine learning. Now we turn to the key bridge that connects text and vision: understanding how a model can “see” an image through the lens of natural language. This is where contrastive language-image pretraining (CLIP) comes in.

Released by OpenAI in 2021, CLIP is a multimodal transformer that learns to align images and text in a shared latent space [1]. Unlike traditional models that rely on explicit image labels, CLIP uses enormous datasets of real-world image–caption pairs, making it incredibly effective for associating images and their textual descriptions.

8.1 The CLIP model

8.1.1 How the CLIP model works

8.1.2 Selecting an image from Flickr 8k based on a text description

8.2 Preparing the training dataset

8.2.1 Image-caption pairs in Flickr 8k

8.2.2 The DistilBERT tokenizer

8.2.3 Preprocess captions and images for training

8.3 Creating a CLIP model

8.3.1 Creating a text encoder

8.3.2 Creating an image encoder

8.3.3 Building a CLIP model

8.4 Training and using the CLIP model

8.4.1 Training the CLIP model

8.4.2 Using the trained CLIP model to select images

8.4.3 Using the OpenAI pretrained CLIP model to select images