9 Text-to-image generation with latent diffusion
This chapter covers
- Conducting forward and reverse diffusion processes in the lower-dimensional space
- Compressing an image into a lower-dimensional space
- Converting low-resolution images into high-resolution ones
- Generating high-resolution images based on text prompts using a latent diffusion model
- Modifying existing images based on text prompts
As we’ve explored so far, diffusion models can generate strikingly realistic images from random noise by gradually reversing a noising process. When conditioned on text prompts, these models become powerful text-to-image generators, capable of creating images that match detailed, open-ended descriptions. But there’s a challenge: generating high-resolution images directly with diffusion models is computationally demanding, often requiring tens or hundreds of millions of pixel-level calculations over thousands of steps for every single image.
How do state-of-the-art text-to-image generators overcome this barrier to produce high-quality results efficiently? The answer lies in latent diffusion models (LDMs), introduced by Rombach et al. in 2022.[1] LDMs move the heavy lifting away from the pixel space and into a compact, learned latent space, dramatically reducing both the memory footprint and computational cost of training and generation.