15 Diffusion Models and Text-to-Image Transformers

This chapter covers

How forward diffusion and reverse diffusion work
How to build and train a denoising U-Net model
Using the trained U-Net to generate flower images
Concepts behind text-to-image Transformers
Writing a Python program to generate an image through text with DALL-E 2

In recent years, multimodal large language models (LLMs) have gained significant attention for their ability to handle various content formats, such as text, images, video, audio, and code. A notable example of this is text-to-image Transformers, such as OpenAI's DALL-E 2, Google's Imagen, and Stability AI's Stable Diffusion. These models are capable of generating high-quality images based on textual descriptions.

15.1 Introduction to denoising diffusion models

15.1.1 The forward diffusion process

15.1.2 Use the U-Net model to denoise images

15.1.3 A blueprint to train the denoising U-Net model

15.2 Prepare the training data

15.2.1 Flower images as the training data

15.2.2 Visualize the forward diffusion process

15.3 Build a denoising U-Net model

15.3.1 The attention mechanism in the denoising U-Net model

15.3.2 The denoising U-Net model

15.4 Train and use the denoising U-Net model

15.4.1 Train the denoising U-Net model

15.4.2 Use the trained model to generate flower images

15.5 Text to image Transformers

15.5.1 CLIP: A multi-modal Transformer

15.5.2 Text to image generation with DALL-E 2

15.6 Summary