chapter six

6 Analyzing images and videos

This chapter covers

Analyzing images
Comparing images
Analyzing videos

In the previous chapters, we have seen how to analyze text and structured data. Does that cover everything? Not even close! By far, the largest portion of data out there comes in the form of images and videos. For instance, videos alone account for an impressive two-thirds of the total data volume exchanged over the internet! In this chapter, we will see how language models can also help us extract useful insights from such data types.

The following sections introduce a couple of small projects that process images and video data. GPT-4o is a natively multimodal model; we can use it for all these tasks. First, we will see how to use GPT-4o to answer free-form questions (in natural language) about images. Second, we will use GPT-4o to build an automated picture-tagging application, automatically tagging our holiday pictures with the people who appear in them.

Finally, we will use GPT-4o to automatically generate titles for video files. The goal of these mini-projects is to illustrate features for visual data processing offered by the latest generation of large language models. After working through those projects, you should be able to build your own applications for image and video data processing in various scenarios.

6.1 Setup

6.2 Answering questions about images

6.2.1 Specifying multimodal input

6.2.2 Code discussion

6.2.3 Trying it out

6 Analyzing images and videos

This chapter covers

6.1 Setup

6.2 Answering questions about images

6.2.1 Specifying multimodal input

6.2.2 Code discussion

6.2.3 Trying it out

6.3 Tagging people in images

6.3.1 Overview

6.3.2 Encoding locally stored images

6.3.3 Sending locally stored images to OpenAI

6.3.4 The end-to-end implementation

6.3.5 Trying it out

6.4 Generating titles for videos

6.4.1 Overview

6.4.2 Encoding video frames

6.4.3 The end-to-end implementation

6.4.4 Trying it out