Chapter 16. Cloud Speech: audio-to-text conversion

 

This chapter covers

  • An overview of speech recognition
  • How the Cloud Speech API works
  • How Cloud Speech pricing is calculated
  • An example of generating automated captions from audio content

When we talk about speech recognition, we generally mean taking an audio stream (for example, an MP3 file of a book on tape) and turning it into text (in this case, back into the actual written book). This process sounds straightforward, but as you may know, language is a particularly tricky human construct. For instance, the psychological phenomenon called the McGurk effect changes what we hear based on what we see. In one classic example, the sound “ba” can be perceived as “fa” so long as we see someone’s mouth forming an “f” sound. As you might expect, an audio track alone is not always enough to completely understand what was said.

This confusion might seem weird given that we’ve survived with phone calls all these years. It turns out that there is a difference between hearing and listening. When you hear something, you’re taking sounds and turning them into words. When you listen, you’re taking sounds and combining them with your context and understanding, so you can fill in the blanks when some sounds are ambiguous. For example, if you heard someone say, “I drove the -ar back,” even if you missed the first consonant of that “ar” sound, you could use the context of “drove” to guess that this word was “car.”

16.1. Simple speech recognition

16.2. Continuous speech recognition

16.3. Hinting with custom words and phrases

16.4. Understanding pricing

16.5. Case study: InstaSnap video captions

Summary