5 Generate images with diffusion models
This chapter covers
- How the forward diffusion process gradually adds noise to images
- How the reverse diffusion process iteratively removes noise
- Training a denoising U-Net model from scratch
- Using the trained model to generate new clothing-item images
Text-to-image generation has seen remarkable progress in recent years, largely thanks to two classes of models: vision transformers (ViTs) and diffusion models. Diffusion models create images through a two-step process. First, they learn to gradually add random noise to clean images, step-by-step, until the images become pure noise. This is called the forward diffusion process. Then, the models are trained to remove noise from images in the reverse diffusion process: starting with pure noise, a diffusion model learns to iteratively remove noise, guided by learned patterns, until a new, clean image emerges. By controlling each small denoising step, diffusion models can generate high-resolution images that surpass the quality of images generated by other approaches (variational autoencoders or generative adversarial networks).