chapter eleven

11 VQGAN: Convert images into sequences of integers

This chapter covers

Encoding images into continuous latent representations.
Quantizing these latent representations into discrete codes using a codebook.
Reconstructing images from discrete sequences.
Understanding perceptual loss, adversarial loss, and quantization loss.

Modern transformer-based text-to-image models, such as DALL-E, rely on a crucial step: transforming images into sequences of discrete tokens, just as language models treat text as sequences of word tokens. Figure 11.1 shows how this step fits into the larger journey of building a text-to-image generator. In this chapter, we zero in on step 7, where VQGAN bridges the gap between images and language-like data, making images accessible to transformers.

Figure 11.1 Eight steps for building a text-to-image generator from scratch. This chapter focuses on step 7: transforming an image into a sequence of integers using VQGAN. By achieving this, we unlock the ability to generate images sequentially with transformer models, a critical advance that powers state-of-the-art systems like DALL-E.

11.1 Convert images into sequences of integers and back

11.2 Variational autoencoders (VAEs)

11.2.1 What is an autoencoder

11.2.2 The need for VAEs and their training methodology

11.3 Vector quantized variational autoencoders (VQ-VAEs)

11.3.1 The need for VQ-VAEs

11.3.2 VQ-VAE model architecture and training process

11.4 Vector quantized generative adversarial networks (VQGANs)

11.4.1 Generative adversarial networks (GANs)

11.4.2 VQGAN: A GAN with a VQ-VAE Generator

11.5 A Pretrained VQGAN Model

11.5.1 Reconstruct images with the pretrained VQGAN

11.5.2 Convert images into sequences of integers

11.6 Summary