chapter four

4 Working with Multimodal Foundational Models

This chapter covers

Overview of multimodal foundational models
Best practices for creating prompts for multimodal models
Enhancing context through multimodal foundational models
Working with Amazon SageMaker Jumpstart
Evaluating multimodal foundational models

Amazon Bedrock is enhancing the AI landscape with its support for multimodal foundational models. Multimodal foundational models are transforming artificial intelligence by enabling systems to process and understand multiple types of data simultaneously. These models are equipped to analyze, interpret, and generate responses that integrate text, images, audio, and video, providing a holistic understanding that mirrors human-like comprehension. This capability is crucial in scenarios where complex data interactions are necessary, such as in advanced virtual assistants that can interpret both the verbal instructions and emotional tones of users, as well as the visual context provided by images or live video feeds.

4.1 Overview of Multimodal Foundational Models

4.2 Best practices in making prompts for multimodal foundational models

4.2.1 Key Differences Between Text and Multimodal Foundational Models

4.2.2 Best Practices in Working with Multimodal Foundational Models

4.2.3 Actionable Tips for Applying Model Architecture and Training Data in Prompts

4.3 Enhancing Context with Multimodal foundational models

4.3.1 Image to Image

4.3.2 Image Inpainting

4.3.3 Image Outpainting

4.3.4 Visual Question Answering

4.3.5 Image Captioning

4.4 Working with Amazon SageMaker Jumpstart

4.4.1 Preparation

4.4.2 Configuration of Permissions and Variables

4.4.3 Model Retrieval & Endpoint Deployment

4.4.4 Endpoint Interaction & Response Handling

4.5 Practical Exercise: Creating a Movie Recognizer

4.5.1 Introduction

4.5.2 Data

4.5.3 Embeddings

4.5.4 Search Index

4.5.5 Retrieving Results

4.5.6 Demo Application

4.6 Summary