chapter six

6 Multimodal models

 

This chapter covers

  • Introducing multimodal large language models
  • Embeddings for text, image, audio, and video
  • Example tasks for each modality
  • Building end-to-end multimodal retrieval-augmented generation pipeline

Multimodal large language models (MLLMs) are systems that can process and reason over multiple types of input, such as text, images, or speech, by combining them into a shared representation. This enables them to answer questions, describe scenes, or take actions that depend on more than one kind of information. Instead of treating each modality in isolation, these models connect them so that features from one can inform the interpretation of another.

Bringing multiple streams of information together is both powerful and technically challenging. Each modality has its own structure—pixels, tokens, or waveforms—and aligning them requires careful design choices. Yet when integrated successfully, multimodal reasoning allows models to perform tasks that go far beyond the capabilities of text-only systems.

In the previous chapter, we focused on aligning large language models (LLMs) with human preferences and extending their knowledge through external text-based sources. Those methods still operated within a single modality: text. Multimodality extends this foundation, broadening the scope to richer and more diverse forms of input.

6.1 Getting started with multimodal models

6.2 Combining modalities from different domains

6.3 Modality-specific tokenization

6.3.1 Images and visual embeddings

6.3.2 Image analysis with an MLLM

6.3.3 From image patches to video cubes

6.3.4 Video information extraction

6.3.5 Audio embeddings

6.3.6 Audio-only pipeline: Extraction and inference

6.4 Multimodal RAG: From PDF to images, tables, and cross-model comparison

Summary