chapter nine

9 Recording Audio and Transcribing Speech with MLX Whisper

This chapter covers

What speech recognition is and how Whisper achieves human-level accuracy
Why MLX Whisper runs dramatically faster than standard Whisper on Apple Silicon
How digital audio works -- sample rate, bit depth, and WAV format -- and how to record from your microphone using Python
Transcribing speech to text with a single function call and choosing the right Whisper model size
Building a reusable voice transcription script for the chatbot application in Chapter 10

You have spent the past eight chapters building the text component of your voice AI: the LLM backend, the Python layer, and the Streamlit web interface. In this chapter, you add the audio component. By the end, you will have a standalone Python script that listens to your microphone, converts your speech to text, and prints the transcription. That capability becomes the foundation for the voice-enabled chatbot application you will build in Chapter 10.

9.1 What Is Speech Recognition?

Speech recognition (also called automatic speech recognition, or ASR) is the task of converting audio into the text that the speech represents. It sounds simple, but it is one of the hardest problems in AI: speakers vary in accent, pace, and pronunciation; background noise corrupts recordings; words run together; and the same sound can mean different things depending on context.

9.1.1 How Whisper Changed Everything

9.2 Why MLX Whisper on Apple Silicon

9.3 Installing MLX Whisper

9.3.1 Verifying the Installation

9.4 Understanding Digital Audio

9.4.1 Sample Rate

9.4.2 Channels

9.4.3 Bit Depth and Data Type

9.5 Recording from the Microphone

9.5.1 `sd.rec()` -- The Recording Call

9.5.2 Squeezing the Channel Dimension

9.5.3 Converting to int16 for WAV

9.6 Your First Transcription

9.6.1 What `mlx_whisper.transcribe()` Returns

9.7 Choosing the Right Whisper Model

9.7.1 Which Model Should You Use?

9.7.2 Trying a Larger Model

9.8 Building a Reusable Transcription Script

9.9 Troubleshooting

9.9.1 "No default input device"

9.9.2 Audio is captured but Whisper returns empty text

9.9.3 "Cannot import mlx_whisper"

9.9.4 First download is slow

9.9.5 “File not found” for ffmpeg

9.10 Summary

9.11 Exercises