chapter fifteen

15 Diffusion models and text-to-image Transformers

This chapter covers

How forward diffusion and reverse diffusion work
How to build and train a denoising U-Net model
Using the trained U-Net to generate flower images
Concepts behind text-to-image Transformers
Writing a Python program to generate an image through text with DALL-E 2

In recent years, multimodal large language models (LLMs) have gained significant attention for their ability to handle various content formats, such as text, images, video, audio, and code. A notable example of this is text-to-image Transformers, such as OpenAI’s DALL-E 2, Google’s Imagen, and Stability AI’s Stable Diffusion. These models are capable of generating high-quality images based on textual descriptions.

15.1 Introduction to denoising diffusion models

15.1.1 The forward diffusion process

15.1.2 Using the U-Net model to denoise images

15.1.3 A blueprint to train the denoising U-Net model

15.2 Preparing the training data

15.2.1 Flower images as the training data

15.2.2 Visualizing the forward diffusion process

15.3 Building a denoising U-Net model

15.3.1 The attention mechanism in the denoising U-Net model

15.3.2 The denoising U-Net model

15.4 Training and using the denoising U-Net model

15.4.1 Training the denoising U-Net model

15.4.2 Using the trained model to generate flower images

15.5 Text-to-image Transformers

15.5.1 CLIP: A multimodal Transformer

15.5.2 Text-to-image generation with DALL-E 2

Summary