chapter eight

8 Detect voice faking using transformers

This chapter covers

A brief review of voice faking
Understanding the Fake-or-Real dataset
Extracting audio features from voice samples
Training a transformer model to detect fake voices
Testing model performance on a different fake voice dataset

“Please wait as we connect you to one of our representatives”. It’s a line most of us have heard over the phone at some point while calling customer service. This voice is computer-automated, which means that no actual person is speaking live with you. A human pre-records such lines or the lines are entirely computer-generated (think Apple’s Siri or Amazon’s Alexa). With the availability of increasingly sophisticated AI, computer-generated voices are becoming the norm with use cases such as personal voice assistants (Siri, Alexa), customer service (Bland AI, Retell AI), celebrity voices (ElevenLabs, Descript), and so on.

8.1 Understanding the Fake-or-Real dataset

8.1.1 Extracting audio features from voice samples

8.1.2 How MFCCs work

8.1.3 Creating Torch datasets and dataloaders for the FoR dataset

8.1.4 Visualizing audio features

8.2 Training a transformer model for voice faking detection

8.2.1 Defining a transformer model for fake voice detection

8.2.2 Training and validating the fake voice detection model.

8.2.3 Testing trained model on the testing dataset

8.3 Testing trained model on the DEEP-VOICE dataset

8.3.1 Understanding the Deep Voice dataset

8.3.2 Processing audio files from the Deep Voice dataset

8.3.3 Running inference on AudioTransformer model

8.4 Summary