9 Text-to-image generation with latent diffusion

 

This chapter covers

  • Conducting forward and reverse diffusion processes in the lower-dimensional space
  • Compressing an image into a lower-dimensional space
  • Converting low-resolution images into high-resolution ones
  • Generating high-resolution images based on text prompts using a latent diffusion model
  • Modifying existing images based on text prompts

As we’ve explored so far, diffusion models can generate strikingly realistic images from random noise by gradually reversing a noising process. When conditioned on text prompts, these models become powerful text-to-image generators, capable of creating images that match detailed, open-ended descriptions. But there’s a challenge: generating high-resolution images directly with diffusion models is computationally demanding, often requiring tens or hundreds of millions of pixel-level calculations over thousands of steps for every single image.

How do state-of-the-art text-to-image generators overcome this barrier to produce high-quality results efficiently? The answer lies in latent diffusion models (LDMs), introduced by Rombach et al. in 2022.[1] LDMs move the heavy lifting away from the pixel space and into a compact, learned latent space, dramatically reducing both the memory footprint and computational cost of training and generation.

9.1 What is a latent diffusion model

9.1.1 How variational autoencoders (VAEs) work

9.1.2 Combine LDM with VAE

9.2 Compress and reconstruct images with VAEs

9.2.1 Download the pre-trained VAE

9.2.2 Encode and decode images with the pre-trained VAE

9.3 Text-to-image generation with latent diffusion

9.3.1 Guidance by the CLIP model

9.3.2 Diffusion in the latent space

9.3.3 Convert latent images to high-resolution ones

9.4 Modify existing images with text prompts

9.5 Summary