10 Creating robust coverage for speech-to-text resolution

 

This chapter covers

  • In-depth understanding of speech-to-text components and how ASR works
  • How to create robust grammars that avoid the biggest pitfalls
  • How modifying grammar coverage helps users succeed

Today’s speech recognition (ASR) engines are amazing and better than ever, but they’re still not perfect. If you use dictation on your mobile phone you’ve seen the mistakes that can happen. The process of correctly turning someone’s spoken utterance into text is very hard.

Adding dialog makes it even harder. Anything that goes wrong early in the recognition process is amplified in each later step with odd results that we human listeners avoid thanks to our robust cognitive processing that’s not available to today’s computer implementations—maybe someday. If sounds aren’t recognized, users’ words are misinterpreted, meaning the speech-to-text (STT) is wrong. If STT produces incorrect or unexpected words and phrases, NL processing and intent evaluation won’t have what it needs to succeed.

10.1  Recognition is speech-to-text interpretation

10.2  Inside the STT box

10.3  Recognition engines

10.4  Grammar concepts

10.4.1    Coverage

10.4.2    Recognition space

10.4.3    Static or dynamic, large or small

10.4.4    End-pointing

10.4.5    Multiple hypotheses

10.5  Types of grammars

10.5.1    Rule-based grammars

10.5.2    Statistical models

10.5.3    Hot words

10.5.4    Wake words and invocation names

10.6  Working with grammars

10.6.1    Writing rule-based regular expressions

10.7  How to succeed with grammars

10.7.1    Bootstrap

10.7.2    Normalize punctuation and spellings

10.7.3    Handle unusual pronunciations

10.7.4    Use domain knowledge

10.7.5    Understand the strengths and limitations of STT

10.8  A simple example

10.8.1    Sample phrases in Dialogflow

10.8.2    Regular expressions in the webhook

10.9  Limitations on grammar creation and use

10.10   What’s next

10.11   Summary