chapter one

1 A tale of two models: transformers and diffusions

This chapter covers

What are text-to-image generation models
Unimodal versus multimodal models
Two ways of text-to-image generation: transformers and diffusions
Challenges and limitations related to text-to-image generation models

Generative AI is evolving rapidly, revolutionizing every aspect of our lives and work. Text-to-image models, in particular, have gained significant attention due to their ability to translate natural language into visually rich, meaningful images. Models like OpenAI’s DALL-E series, Google’s Imagen, and Stability AI’s Stable Diffusion have shown unprecedented advances in the field of generative AI, turning abstract descriptions into detailed, highly creative visual representations.

1.1 What are text-to-image generation models?

1.1.1 Unimodal versus multimodal models

1.1.2 Practical use cases of text-to-image models

1.2 Transformer-based text-to-image generation

1.2.1 Convert an image into a sequence of integers and then back

1.2.2 Train and use a transformer-based text-to-image model

1.3 Text-to-image generation with diffusion models

1.3.1 Forward and reverse diffusions

1.3.2 Latent diffusion models and Stable Diffusion

1.4 Build text-to-image models from scratch

1.5 Challenges faced by text-to-image generation models

1.5.1 The pink elephant problem

1.5.2 Stealing from artists?

1.5.3 The geometric inconsistency problem

1.6 Social, environmental, and ethical concerns

1.7 Summary