chapter sixteen

Chapter 16. Cloud Speech: audio-to-text conversion

This chapter covers

An overview of speech recognition
How the Cloud Speech API works
How Cloud Speech pricing is calculated
An example of generating automated captions from audio content

When we talk about speech recognition, we generally mean taking an audio stream (for example, an MP3 file of a book on tape) and turning it into text (in this case, back into the actual written book). This process sounds straightforward, but as you may know, language is a particularly tricky human construct. For instance, the psychological phenomenon called the McGurk effect changes what we hear based on what we see. In one classic example, the sound “ba” can be perceived as “fa” so long as we see someone’s mouth forming an “f” sound. As you might expect, an audio track alone is not always enough to completely understand what was said.

This confusion might seem weird given that we’ve survived with phone calls all these years. It turns out that there is a difference between hearing and listening. When you hear something, you’re taking sounds and turning them into words. When you listen, you’re taking sounds and combining them with your context and understanding, so you can fill in the blanks when some sounds are ambiguous. For example, if you heard someone say, “I drove the -ar back,” even if you missed the first consonant of that “ar” sound, you could use the context of “drove” to guess that this word was “car.”

Chapter 16. Cloud Speech: audio-to-text conversion

This chapter covers

16.1. Simple speech recognition

16.2. Continuous speech recognition

16.3. Hinting with custom words and phrases

16.4. Understanding pricing

16.5. Case study: InstaSnap video captions

Summary