1 A tale of two models: transformers and diffusions

 

This chapter covers

  • The distinction between unimodal and multimodal models
  • How vision transformers use attention mechanisms from NLP to process images
  • The inner workings of diffusion models and how they generate images from noise
  • The challenges and limitations facing current text-to-image models

Generative artificial intelligence (generative AI) refers to a class of machine learning models designed to create new content, such as text, images, audio, or even video, that closely resembles real-world data. Unlike traditional AI systems that merely classify, predict, or retrieve information, generative AI models are creative: they learn patterns from massive datasets and then generate entirely new outputs based on those patterns. For example, ChatGPT can write essays and code, while DALL-E and Stable Diffusion can produce images from written descriptions.

1.1 What are text-to-image generation models

1.1.1 Unimodal versus multimodal models

1.1.2 Practical use cases of text-to-image models

1.2 Transformer-based text-to-image generation

1.2.1 Convert an image into a sequence of integers and then back

1.2.2 Train and use a transformer-based text-to-image model

1.3 Text-to-image generation with diffusion models

1.3.1 Forward and reverse diffusions

1.3.2 Latent diffusion models and Stable Diffusion

1.4 Build text-to-image models from scratch

1.5 Challenges faced by text-to-image generation models

1.5.1 The pink elephant problem

1.5.2 Are generative AI stealing from artists

1.5.3 The geometric inconsistency problem

1.6 Social, environmental, and ethical concerns

1.7 Summary