chapter one

1 A tale of two models: Transformers and diffusions

 

This chapter covers

  • The distinction between unimodal and multimodal models
  • How vision transformers use attention mechanisms from natural language processing to process images
  • The inner workings of diffusion models and how they generate images from noise
  • The challenges and limitations facing current text-to-image models

Generative artificial intelligence (generative AI) refers to a class of machine learning models designed to create new content—text, images, audio, or even video—that closely resembles real-world data. Unlike traditional AI systems that merely classify, predict, or retrieve information, generative AI models are creative: they “learn” patterns from massive datasets and then generate entirely new outputs based on those patterns. For example, ChatGPT can write essays and code, while DALL-E and Stable Diffusion can produce images from written descriptions.

1.1 What is a text-to-image generation model?

1.1.1 Unimodal vs. multimodal models

1.1.2 Practical use cases of text-to-image models

1.2 Transformer-based text-to-image generation

1.2.1 Converting an image into a sequence of integers and then back

1.2.2 Training and using a transformer-based text-to-image model

1.3 Text-to-image generation with diffusion models

1.3.1 Forward and reverse diffusions

1.3.2 Latent diffusion models and Stable Diffusion

1.4 How to build text-to-image models from scratch

1.5 Challenges for text-to-image generation models

1.5.1 Are generative AI models stealing from artists?

1.5.2 The geometric inconsistency problem

1.6 Social, environmental, and ethical concerns

Summary