part three

Part 3 Text-to-image generation
with diffusion models

Now that you’ve mastered the basics of transformers and diffusion models, we’ll show you how they come together for text-to-image generation in this part. In chapter 8, you’ll learn how to build and train a contrastive language-image pretraining (CLIP) model from scratch. A trained CLIP model enables you to measure the similarity between a text prompt and an image. As a result, you can perform an image selection task: inputting a text description and using the model to identify the image from a large pool that best matches the text. These examples highlight just a few of the many real-world applications of text-to-image generation models and the valuable skill sets they offer.

In chapter 9, we move into latent diffusion, a more efficient version of diffusion models that generates images in compressed latent space rather than pixel space. This sets the stage for a deep dive into Stable Diffusion (in chapter 10), one of the most influential open source models in generative AI. By the end of this part, you’ll have implemented the building blocks of modern diffusion pipelines and gained insight into how text prompts are translated into compelling, high-quality images.

Part 3 Text-to-image generation with diffusion models

Part 3 Text-to-image generation
with diffusion models