This chapter covers
- Compressing a text description and an image into the same latent space
- Building and training a CLIP model to match text–image pairs
- Measuring text–image similarity
- Using the trained CLIP model to select an image based on a text prompt
State-of-the-art text-to-image models such as DALL-E 2, Google’s Imagen, and Stable Diffusion are built on three foundational components: (1) a text encoder to convert language into a latent representation, (2) a mechanism for injecting text information into the image-generation process, and (3) a diffusion model to generate realistic images from noise.
In previous chapters, we explored how diffusion models generate images and how to encode text information for machine learning. Now we turn to the key bridge that connects text and vision: understanding how a model can “see” an image through the lens of natural language. This is where contrastive language-image pretraining (CLIP) comes in.
Released by OpenAI in 2021, CLIP is a multimodal transformer that learns to align images and text in a shared latent space [1]. Unlike traditional models that rely on explicit image labels, CLIP uses enormous datasets of real-world image–caption pairs, making it incredibly effective for associating images and their textual descriptions.