15 Diffusion Models and Text-to-Image Transformers

 

This chapter covers

  • How forward diffusion and reverse diffusion work
  • How to build and train a denoising U-Net model
  • Using the trained U-Net to generate flower images
  • Concepts behind text-to-image Transformers
  • Writing a Python program to generate an image through text with DALL-E 2

In recent years, multimodal large language models (LLMs) have gained significant attention for their ability to handle various content formats, such as text, images, video, audio, and code. A notable example of this is text-to-image Transformers, such as OpenAI's DALL-E 2, Google's Imagen, and Stability AI's Stable Diffusion. These models are capable of generating high-quality images based on textual descriptions.

15.1 Introduction to denoising diffusion models

 
 

15.1.1 The forward diffusion process

 
 
 

15.1.2 Use the U-Net model to denoise images

 
 
 

15.1.3 A blueprint to train the denoising U-Net model

 
 
 
 

15.2 Prepare the training data

 
 
 

15.2.1 Flower images as the training data

 
 
 

15.2.2 Visualize the forward diffusion process

 
 
 
 

15.3 Build a denoising U-Net model

 
 
 

15.3.1 The attention mechanism in the denoising U-Net model

 

15.3.2 The denoising U-Net model

 
 
 

15.4 Train and use the denoising U-Net model

 
 
 
 

15.4.1 Train the denoising U-Net model

 
 
 

15.4.2 Use the trained model to generate flower images

 
 

15.5 Text to image Transformers

 
 

15.5.1 CLIP: A multi-modal Transformer

 
 

15.5.2 Text to image generation with DALL-E 2

 
 
 

15.6 Summary

 
 
 
 
sitemap

Unable to load book!

The book could not be loaded.

(try again in a couple of minutes)

manning.com homepage
test yourself with a liveTest