chapter one

1 A tale of two models: transformers and diffusions

This chapter covers

The distinction between unimodal and multimodal models
How vision transformers use attention mechanisms from NLP to process images
The inner workings of diffusion models and how they generate images from noise
The challenges and limitations facing current text-to-image models

Generative artificial intelligence (generative AI) refers to a class of machine learning models designed to create new content, such as text, images, audio, or even video, that closely resembles real-world data. Unlike traditional AI systems that merely classify, predict, or retrieve information, generative AI models are creative: they learn patterns from massive datasets and then generate entirely new outputs based on those patterns. For example, ChatGPT can write essays and code, while DALL-E and Stable Diffusion can produce images from written descriptions.

1.1 What are text-to-image generation models

1.1.1 Unimodal versus multimodal models

1.1.2 Practical use cases of text-to-image models

1.2 Transformer-based text-to-image generation

1.2.1 Convert an image into a sequence of integers and then back

1.2.2 Train and use a transformer-based text-to-image model

1.3 Text-to-image generation with diffusion models

1.3.1 Forward and reverse diffusions

1.3.2 Latent diffusion models and Stable Diffusion

1.4 Build text-to-image models from scratch

1.5 Challenges faced by text-to-image generation models

1.5.1 The pink elephant problem

1.5.2 Are generative AI stealing from artists

1.5.3 The geometric inconsistency problem

1.6 Social, environmental, and ethical concerns

1.7 Summary