chapter nine

9 Text-to-image generation with latent diffusion

 

This chapter covers

  • Conducting forward and reverse diffusion processes in the lower-dimensional space
  • Converting low-resolution images into high-resolution ones
  • Generating high-resolution images based on text prompts using a latent diffusion model
  • Modifying existing images based on text prompts

As we’ve explored so far, diffusion models can generate strikingly realistic images from random noise by gradually reversing a noising process. When conditioned on text prompts, these models become powerful text-to-image generators, capable of creating images that match detailed, open-ended descriptions. But there’s a challenge: generating high-resolution images directly with diffusion models is computationally demanding, often requiring tens or hundreds of millions of pixel-level calculations over thousands of steps for every single image.

How do state-of-the-art text-to-image generators overcome this barrier to produce high-quality results efficiently? The answer lies in latent diffusion models (LDMs), introduced by Rombach et al. in 2022 [1]. LDMs move the heavy lifting away from the pixel space and into a compact, learned latent space, dramatically reducing both the memory footprint and computational cost of training and generation.

9.1 What is a latent diffusion model?

9.1.1 How variational autoencoders work

9.1.2 Combining a latent diffusion model with a variational autoencoder

9.2 Compressing and reconstructing images with VAEs

9.2.1 Downloading the pretrained VAE

9.2.2 Encoding and decoding images with the pretrained VAE

9.3 Text-to-image generation with latent diffusion

9.3.1 Guidance by the CLIP model

9.3.2 Diffusion in the latent space

9.3.3 Converting latent images to high-resolution ones

9.4 Modifying existing images with text prompts

Summary