6 Multimodal models
This chapter covers
- Introduction to multimodal LLMs
- Embeddings for text, image, audio, and video
- Example tasks for each modality
- Building an end-to-end multimodal RAG pipeline
Multimodal large language models (MLLMs) are systems that can process and reason over multiple types of input, such as text, images, or speech, by combining them into a shared representation. This enables them to answer questions, describe scenes, or take actions that depend on more than one kind of information. Instead of treating each modality in isolation, these models connect them so that features from one can inform the interpretation of another.
Bringing multiple streams of information together is both powerful and technically challenging. Each modality has its own structure — pixels, tokens, or waveforms — and aligning them requires careful design choices. Yet when integrated successfully, multimodal reasoning allows models to perform tasks that go far beyond the capabilities of text-only systems.
In the previous chapter, we focused on aligning LLMs with human preferences and extending their knowledge through external text-based sources. Those methods still operated within a single modality: text. Multimodality extends this foundation, broadening the scope to richer and more diverse forms of input.