5 Building multimodal crews
This chapter covers
- Vision-capable LLMs in CrewAI
- Passing images to agents with input_files
- Multimodal embeddings for similarity search
- A three-agent product catalog crew
If you have ever listed a product on an e-commerce marketplace, you know the drill: take photos, open an editor, squint at the image, type out a title, write a description that sounds compelling but honest, pick keywords, figure out a price, and repeat. Multiply that by a hundred products and you have a full-time job that most teams would rather not do manually.
In this chapter, we'll build a crew that automates most of that work. You hand it a product photo, and it gives you back a complete listing with a title, description, SEO keywords, and a competitive analysis against your existing catalog. The crew has three agents: an image analyzer that looks at the photo and extracts product details, a description writer that turns those details into a polished listing, and a catalog analyst that searches the existing product catalog to find similar items and suggest pricing.
The agents we're building here work directly with images—looking at product photos and reasoning about what they see, much like a person would. This capability is known as multimodality: the ability of a model to work across different types of input, such as text and images, within the same system.