8 Generating with voice and pictures
This chapter covers
- Transcribing audio to text
- Generating audio from text
- Images as prompt context
- Generating images
Throughout history, we humans have developed several different ways of communicating with each other. Perhaps the oldest form of human communication is voice-based, where people speak and listen to each other. Text-based communication has taken many forms, from early hieroglyphs and the origin of the alphabet by the Phoenicians to letters, emails, and SMS text messages. And sometimes an image can, indeed, paint a thousand words, meaning that works of art and photographs make for a powerful form of communication that text and voice cannot compete with.
Thus far, our project has focused on text-based interaction with the Board Game Buddy application. The questions asked about games are sent in as text and the answers received are just more text. Since it will be humans who will ultimately be interacting with Board Game Buddy, it makes sense to offer more human-style communication with the application.
In this chapter, we’re going to leverage Spring AI to break away from text-based interaction, enabling speech-based and image-based communication in our application, both as input and output. Let’s start by seeing how Spring AI can enable us to add voice to an application.