chapter five

5 Building multimodal crews

This chapter covers

Vision-capable LLMs in CrewAI
Passing images to agents with input_files
Multimodal embeddings for similarity search
A three-agent product catalog crew

If you have ever listed a product on an e-commerce marketplace, you know the drill: take photos, open an editor, squint at the image, type out a title, write a description that sounds compelling but honest, pick keywords, figure out a price, and repeat. Multiply that by a hundred products and you have a full-time job that most teams would rather not do manually.

In this chapter, we'll build a crew that automates most of that work. You hand it a product photo, and it gives you back a complete listing with a title, description, SEO keywords, and a competitive analysis against your existing catalog. The crew has three agents: an image analyzer that looks at the photo and extracts product details, a description writer that turns those details into a polished listing, and a catalog analyst that searches the existing product catalog to find similar items and suggest pricing.

The agents we're building here work directly with images—looking at product photos and reasoning about what they see, much like a person would. This capability is known as multimodality: the ability of a model to work across different types of input, such as text and images, within the same system.

5.1 Working with product images

5.1.1 Setting up the project

5.1.2 The sample product catalog

5 Building multimodal crews

This chapter covers

5.1 Working with product images

5.1.1 Setting up the project

5.1.2 The sample product catalog

5.2 The image analyzer agent

5.3 The description writer agent

5.3.1 Assembling the first two agents

5.4 The catalog analyst agent and multimodal embeddings

5.4.1 Understanding multimodal embeddings

5.4.2 Building the product catalog index

5.4.3 The CatalogSimilarityTool

5.4.4 Completing the crew

5.5 From prototype to production

5.6 Summary