12 A minimal implementation of DALL-E

 

This chapter covers

  • How DALL-E is trained to generate images from text descriptions
  • How a pretrained BART encoder transforms a text prompt into dense embeddings
  • How a BART decoder uses those embeddings to predict image tokens
  • How a VQGAN decoder converts image tokens into a high-resolution image

OpenAI’s DALL-E is one of the earliest and most influential text-to-image generators. It is a large-scale transformer that can generate high-resolution images from natural language prompts. DALL-E stands at the intersection of two technologies in modern AI: powerful autoregressive language models and vector-quantized representations of visual information. Yet, for many learners, practitioners, and researchers, the full DALL-E model remains out of reach due to its proprietary nature.

To foster deeper understanding and democratize access, the open-source community has stepped up, replicating DALLE’s core ideas with public tools and datasets. Notably, the DALL-E Mini project and its streamlined PyTorch reimplementation, min-DALL-E, have made it possible for anyone to explore the mechanics of transformer-based text-to-image generation and even build their own models from scratch.

12.1 How does DALL-E work

12.1.1 Training min-DALL-E

12.1.2 From prompt to pixels: image generation at inference time

12.2 Tokenize and encode the text prompt

12.2.1 Tokenize the text prompt

12.2.2 Encode the text prompt

12.3 Iterative prediction of image tokens

12.3.1 Load the pretrained BART decoder

12.3.2 Predict image tokens using the BART decoder

12.4 Convert image tokens to high-resolution images

12.4.1 Load the pretrained VQGAN detokenizer

12.4.2 Visualize the intermediate and final high-resolution outputs

12.5 Summary