1 A tale of two models: transformers and diffusions
This chapter covers
- The distinction between unimodal and multimodal models
- How vision transformers use attention mechanisms from NLP to process images
- The inner workings of diffusion models and how they generate images from noise
- The challenges and limitations facing current text-to-image models
Generative artificial intelligence (generative AI) refers to a class of machine learning models designed to create new content, such as text, images, audio, or even video, that closely resembles real-world data. Unlike traditional AI systems that merely classify, predict, or retrieve information, generative AI models are creative: they learn patterns from massive datasets and then generate entirely new outputs based on those patterns. For example, ChatGPT can write essays and code, while DALL-E and Stable Diffusion can produce images from written descriptions.