4 Add captions to images
This chapter covers
- Similarities between image-to-text generation and text-to-image generation
- Building a transformer from scratch to add captions to images
- Training an image-to-text transformer with image-caption pairs
- Adding captions to images with the trained image-to-text transformer
This chapter marks an important step in our journey to create models that bridge vision and language. While the ultimate goal is to build systems that generate images from text prompts (text-to-image), it is just as essential to master the reverse process: generating descriptive captions for images (image-to-text). Both tasks, despite working in opposite directions, rely on the same core principle: learning deep relationships between two fundamentally different modalities: visual data and natural language.
Training a model to add captions to images is conceptually and practically more accessible than generating a new image from a text prompt. Captioning images forces the model to understand and encode visual features and then map them coherently to linguistic tokens. This not only builds intuition for handling multimodal data but also lays the groundwork for more ambitious tasks, such as producing realistic images from text prompts, which require the model to reverse the process with added complexity.