17 LSTMs and automatic speech recognition

 

This chapter covers

  • Preparing a dataset for automatic speech recognition using the LibriSpeech corpus
  • Training a long short-term memory (LSTM) RNN for converting speech to text
  • Evaluating the LSTM performance during and after training

Speaking and talking to your electronic devices is commonplace nowadays. Years ago, on an early version of my smartphone, I clicked the microphone button and used its dictation function to try to speak an email into existence. The email that my boss received at work had a whole bunch of typos and phonetic errors, though, and he wondered whether I was mixing a little too much after-work activity with my official duties!

The world has evolved, and so has the accuracy of neural networks in performing automatic speech recognition (ASR), which is the process of transforming spoken audio into written text. Whether you are using your phone’s intelligent digital assistant to ask it to schedule a meeting for you, dictating that trusty email, or asking your smart device at home to order something, play background music, or even start your car, the tasks are powered by ASR functionality.

17.1 Preparing the LibriSpeech corpus

17.1.1 Downloading, cleaning, and preparing LibriSpeech OpenSLR data

17.1.2 Converting the audio

17.1.3 Generating per-audio transcripts

17.1.4 Aggregating audio and transcripts

17.2 Using the deep-speech model

17.2.1 Preparing the input audio data for deep speech

17.2.2 Preparing the text transcripts as character-level numerical data

17.2.3 The deep-speech model in TensorFlow

17.2.4 Connectionist temporal classification in TensorFlow

17.3 Training and evaluating deep speech

Summary