4 Working with Multimodal Foundational Models

 

This chapter covers

  • Overview of multimodal foundational models
  • Best practices for creating prompts for multimodal models
  • Enhancing context through multimodal foundational models
  • Working with Amazon SageMaker Jumpstart
  • Evaluating multimodal foundational models

Amazon Bedrock is enhancing the AI landscape with its support for multimodal foundational models. Multimodal foundational models are transforming artificial intelligence by enabling systems to process and understand multiple types of data simultaneously. These models are equipped to analyze, interpret, and generate responses that integrate text, images, audio, and video, providing a holistic understanding that mirrors human-like comprehension. This capability is crucial in scenarios where complex data interactions are necessary, such as in advanced virtual assistants that can interpret both the verbal instructions and emotional tones of users, as well as the visual context provided by images or live video feeds.

4.1 Overview of Multimodal Foundational Models

4.2 Best practices in making prompts for multimodal foundational models

4.2.1 Key Differences Between Text and Multimodal Foundational Models

4.2.2 Best Practices in Working with Multimodal Foundational Models

4.2.3 Actionable Tips for Applying Model Architecture and Training Data in Prompts

4.3 Enhancing Context with Multimodal foundational models

4.3.1 Image to Image

4.3.2 Image Inpainting

4.3.3 Image Outpainting

4.3.4 Visual Question Answering

4.3.5 Image Captioning

4.4 Working with Amazon SageMaker Jumpstart

4.4.1 Preparation

4.4.2 Configuration of Permissions and Variables

4.4.3 Model Retrieval & Endpoint Deployment

4.4.4 Endpoint Interaction & Response Handling

4.5 Practical Exercise: Creating a Movie Recognizer

4.5.1 Introduction

4.5.2 Data

4.5.3 Embeddings

4.5.4 Search Index

4.5.5 Retrieving Results

4.5.6 Demo Application

4.6 Summary