5 Generate images with diffusion models
This chapter covers
- How the forward diffusion process gradually adds noise to images
- How the reverse diffusion process iteratively removes noise to create a clean image
- Training a denoising U-Net model from sratch
- Using the trained model to generate new clothing-item images
Text-to-image generation has seen remarkable progress in recent years, largely thanks to two classes of models: vision transformers (ViTs) and diffusion models. In this chapter, we focus on the second approach, diffusion-based generative models, which have quickly become the gold standard for state-of-the-art high-resolution image generation.
At their core, diffusion models create images through a two-step process. First, they learn to gradually add random noise to clean images, step by step, until the images become pure noise. This is called the forward diffusion process. Then, the models are trained to reverse the process: starting with pure noise, a diffusion model learns to iteratively remove noise, guided by learned patterns, until a new, clean image emerges. By controlling each small denoising step, diffusion models can generate high-resolution images that surpass the quality of images generated by other approaches.