chapter ten

10 Creating robust coverage for speech-to-text resolution

 

This chapter covers

  • In-depth understanding of speech-to-text components and how ASR works
  • How to create robust grammars that avoid the biggest pitfalls

Today’s speech recognition (ASR) engines are amazing and better than ever, but they’re still not perfect. If you ever use dictation on your mobile phone you’ve seen the mistakes that can happen. The process of correctly turning someone’s speech utterance into a text representation is very hard. Anything that goes wrong early in the recognition process is amplified in each later step with odd results that we human listeners can avoid thanks to aspects of cognition available to humans but not to statistically based computer implementations. If the ASR has difficulty, user utterances will be misrecognized. If words aren’t recognized correctly, the speech-to-text (STT) will be wrong. If STT results in misrecognized or unexpected words and phrases, the NL processing and intent evaluation won’t have what’s needed to succeed.

10.1  Recognition is speech-to-text resolution

10.2  Inside the STT box

10.3  Recognition engines

10.4  Grammar concepts

10.4.1    Coverage

10.4.2    Recognition space

10.4.3    Static or dynamic, large or small

10.4.4    End-pointing

10.4.5    Multiple hypotheses

10.5  Types of grammars

10.5.1    Rule-based grammars

10.5.2    Statistical models

10.5.3    Hot words

10.5.4    Wake words and invocation names

10.6  Working with grammars

10.6.1    Writing regular expressions

10.7  How to succeed with grammars

10.7.1    Bootstrap

10.7.2    Normalize punctuation and spellings

10.7.3    Handle unusual pronunciations

10.7.4    Use dictionaries and domain knowledge

10.7.5    Understand the strengths and limitations of STT

10.8  Limitations on grammar creation and use

10.9  Summary