This chapter covers
In the previous chapter, we took our first steps in connecting vision and language by learning how models can align these two very different modalities. Now we’ll build on that foundation. While our ultimate goal is text-to-image generation, we’ll first master the reverse process, image-to-text captioning, because both directions rely on the same underlying principle of learning deep cross-modal relationships.
Training a model to add captions to images is conceptually and practically more accessible than generating a new image from a text prompt. Captioning images forces the model to understand and encode visual features and then map them coherently to linguistic tokens. This not only builds intuition for handling multimodal data but also lays the groundwork for more ambitious tasks, such as producing realistic images from text prompts, which require the model to reverse the process with added complexity.