chapter five
5 Generate images with diffusion models
This chapter covers
- How the forward diffusion process gradually adds noise to images
- How the reverse diffusion process iteratively removes noise to create a clean image
- Training a denoising U-Net model
- Using the trained U-Net to generate clothing-item images
This book focuses on two main ways of text-to-image generation. The first way is through vision transformers (ViTs), and the second is through diffusion. In diffusion-based text-to-image generation, we start with an image with pure noise. We ask the trained diffusion model to denoise it slightly, conditional on the text prompt. The result is a less noisy image, which is again fed to the diffusion model to remove the noise. We repeat the process many times and the output is a clean image that matches the text prompt. Diffusion models have become the go-to generative models because they produce higher-quality, more detailed images thanks to their iterative denoising process.