chapter eleven

11 VQGAN: Convert images into sequences of integers

 

This chapter covers

  • Encoding images into continuous latent representations
  • Quantizing latent representations into discrete codes using a codebook
  • Reconstructing images from discrete sequences
  • Understanding perceptual loss, adversarial loss, and quantization loss

Modern transformer-based text-to-image models, such as DALL-E, rely on a crucial step: transforming images into sequences of discrete tokens, just as language models treat text as sequences of word tokens. Figure 11.1 shows how this step fits into the larger journey of building a text-to-image generator. In this chapter, we zero in on step 7, where vector quantized generative adversarial network (VQGAN) bridges the gap between images and language-like data, making images accessible to transformers.

Figure 11.1 Eight steps for building a text-to-image generator from scratch. This chapter focuses on step 7: transforming an image into a sequence of integers using VQGAN. By achieving this, we unlock the ability to generate images sequentially with transformer models, a critical advance that powers state-of-the-art systems such as DALL-E.

11.1 Converting images into sequences of integers and back

11.2 Variational autoencoders

11.2.1 What is an autoencoder?

11.2.2 The need for VAEs and their training methodology

11.3 Vector quantized variational autoencoders

11.3.1 The need for VQ-VAEs

11.3.2 The VQ-VAE model architecture and training process

11.4 Vector quantized generative adversarial networks

11.4.1 Generative adversarial networks

11.4.2 VQGAN: A GAN with a VQ-VAE generator

11.5 A pretrained VQGAN model

11.5.1 Reconstructing images with the pretrained VQGAN

11.5.2 Converting images into sequences of integers

Summary