8 CLIP: a model to measure the similarity between image and text
This chapter covers
- Compressing a text description and an image into the same latent space
- Building and training a contrastive language image pretraining (CLIP) model to match text-image pairs
- Measuring the similarity between an image and a text description
- Using the trained CLIP model to select an image based on a text prompt
- Using the OpenAI pretrained CLIP to select an image based on text
State-of-the-art text-to-image models such as DALL-E 2, Google’s Imagen, and Stable Diffusion are built on three foundational components:
- A text encoder to convert language into a latent representation;
- A mechanism for injecting text information into the image generation process; and
- A diffusion model to generate realistic images from noise.
In previous chapters, we explored how diffusion models generate images and how to encode text information for machine learning. Now, we turn to the key bridge that connects text and vision: understanding how a model can “see” an image through the lens of natural language.
This is where CLIP (Contrastive Language-Image Pretraining) comes in.