chapter seven

7 Analyzing audio data

This chapter covers

Transcribing audio data
Translating audio data
Generating speech

Watch any credible science fiction TV show or movie, and you won’t see people typing to interact with their computers! Whether it’s Star Trek or 2001: A Space Odyssey (both released in the 1960s), people speak to (not type into) their machines. And there are good reasons for that! For most users, voice is the most natural form of communication (because that’s the one they start with). No wonder people imagined speaking with computers long before that was technically feasible.

Reality has now caught up with science fiction, and voice assistants, including the likes of Amazon’s Alexa, Google’s Assistant, and Microsoft’s Cortana (among many others), are ubiquitous. The newest generation of speech recognition (and speech generation) models have reached near-human levels of proficiency. And voice-based interaction with computers is, of course, only one use case for this amazing technology.

7.1 Preliminaries

7.2 Transcribing audio files

7.2.1 Transcribing speech

7.2.2 End-to-end code

7.2.3 Trying it out

7.3 Querying relational data via voice

7.3.1 Preliminaries

7.3.2 Overview

7.3.3 Recording audio

7.3.4 End-to-end code

7.3.5 Trying it out

7.4 Speech-to-speech translation

7.4.1 Overview

7.4.2 Generating speech

7.4.3 End-to-end code

7.4.4 Trying it out

Summary