chapter seventeen

17 LSTMs and automatic speech recognition

 

This chapter covers

  • Preparing a train, test, and evaluation dataset for automatic speech recognition using the LibriSpeech corpus
  • Training and building a long short-term memory (LSTM) recurrent neural network (RNN) for converting speech to text
  • Evaluating the LSTM performance during and after training

Speaking and talking to your electronic devices is mostly commonplace nowadays and it wasn’t always like that. Years ago on an early version of my smart phone, I recall clicking the microphone button and using its dictation function, trying to speak an email into existence. Let’s just say the email that my boss received at work had a whole bunch of typos, phonetic errors, and he was wondering if I was mixing a little too much of after-work activities with my official duties!

The world has evolved and so has the ability of neural networks to refine their accuracy and ability to perform automatic speech recognition (ASR), which is the process of transforming spoken audio into written text. If you think about it, whether you are using your phone’s intelligent digital assistant to ask it to schedule a meeting for you, or dictating that trusty email, or asking your smart device at home to order something off the web, or moreover playing background music—heck, even starting your car—it’s all powered by ASR functionality!

17.1  Preparing the LibriSpeech corpus

17.1.1    Downloading, cleaning, and preparing LibriSpeech OpenSLR data

17.1.2    Converting the audio

17.1.3    Generating per audio transcripts

17.1.4    Aggregating audio and transcripts

17.2  The deep-speech model

17.2.1    Preparing the input audio data for deep speech

17.2.2    Preparing the text transcripts into character level numerical data

17.2.3    The deep-speech model in TensorFlow

17.2.4    Connectionist temporal classification in TensorFlow

17.3  Training deep speech and evaluating it

17.4  Summary