chapter five

5 Generate images with diffusion models

 

This chapter covers

  • How the forward diffusion process gradually adds noise to images
  • How the reverse diffusion process iteratively removes noise
  • Training a denoising U-Net model from scratch
  • Using the trained model to generate new clothing-item images

Text-to-image generation has seen remarkable progress in recent years, largely thanks to two classes of models: vision transformers (ViTs) and diffusion models. Diffusion models create images through a two-step process. First, they learn to gradually add random noise to clean images, step-by-step, until the images become pure noise. This is called the forward diffusion process. Then, the models are trained to remove noise from images in the reverse diffusion process: starting with pure noise, a diffusion model learns to iteratively remove noise, guided by learned patterns, until a new, clean image emerges. By controlling each small denoising step, diffusion models can generate high-resolution images that surpass the quality of images generated by other approaches (variational autoencoders or generative adversarial networks).

5.1 The forward diffusion process

5.1.1 How diffusion models work

5.1.2 Visualizing the forward diffusion process

5.1.3 Different diffusion schedules

5.2 The reverse diffusion process

5.3 A blueprint to train the U-Net model

5.3.1 Steps in training a denoising U-Net model

5.3.2 Preprocessing the training data

5.4 Training and using the diffusion model

5.4.1 The Denoising Diffusion Probabilistic Model noise scheduler

5.4.2 Inference using the U-Net denoising model

5.4.3 Training and using the denoising U-Net model

Summary