10 Creating robust coverage for speech-to-text resolution
This chapter covers
- In-depth understanding of speech-to-text components and how ASR works
- How to create robust grammars that avoid the biggest pitfalls
Today’s speech recognition (ASR) engines are amazing and better than ever, but they’re still not perfect. If you ever use dictation on your mobile phone you’ve seen the mistakes that can happen. The process of correctly turning someone’s speech utterance into a text representation is very hard. Anything that goes wrong early in the recognition process is amplified in each later step with odd results that we human listeners can avoid thanks to aspects of cognition available to humans but not to statistically based computer implementations. If the ASR has difficulty, user utterances will be misrecognized. If words aren’t recognized correctly, the speech-to-text (STT) will be wrong. If STT results in misrecognized or unexpected words and phrases, the NL processing and intent evaluation won’t have what’s needed to succeed.