chapter nine

9 Bridging Language and Vision with Transformers

This chapter covers

The Transformer architecture and its impact on natural language processing
Vision Transformer and its adaptation of Transformers for image understanding
How CLIP linked text and images in a shared representation space
How these advances paved the way for generating images from text

Throughout this book, we have primarily focused on generating images from random noise or conditioned by specific labels or input images. However, one of the most exciting and rapidly developing areas of generative AI is the ability to integrate different modalities of data, particularly language and vision. Multimodal models are designed to understand and generate content that uses both visual and textual information.

Why is this important? Consider the limitations of image generation based solely on random noise. We can control aspects using class labels or input images, but we lack fine-grained control over the content of the generated image; we can’t easily tell the model exactly what we want. Language offers a powerful way to provide precise instructions and descriptions, enabling the creation of images that match our needs.

9.1 Introduction to Multimodal Modeling and Transformers

9.1.1 Understanding Transformers

9.1.2 The Transformer Architecture: An Overview

9.2 Inside the Transformer: Key Components and Mechanisms

9.2.1 Word Embeddings

9.2.2 Positional Encodings

9.2.3 Understanding Attention Mechanisms

9.2.4 Types of Attention Mechanisms

9.2.5 Multi-Head Attention: Multiple Perspectives on the Same Text

9.3 The Complete Transformer Architecture: Putting It All Together

9.3.1 Input Processing

9.3.2 The Encoder Stack

9.3.3 The Decoder Stack

9.3.4 Final Output Layer

9.3.5 Auto-Regressive Generation

9.3.6 Evolution of Transformer Models

9.4 From NLP to Vision: The Vision Transformer (ViT)

9.4.1 Key Differences from Traditional Transformers

9.4.2 The ViT Model Architecture

9.4.3 Comparing ViTs and CNNs

9.5 CLIP: Bridging Vision and Language

9.5.1 Connecting Words and Images

9.5.2 The CLIP Approach: Learning from Internet-Scale Data

9.5.3 CLIP Architecture

9.5.4 CLIP Training Process

9.5.5 The Power of CLIP’s Representations

9.5.6 From CLIP to Text-to-Image Generation

9.6 Conclusion

9.7 Summary