Introduction to Deepfake Detection

It is obvious that Text-to-Speech (TTS) technology can be misused. Deepfakes, i.e. computer-generated image or audio forgery, allow an attacker to put arbitrary words into the mouth of a target person. Fraud, defamation, fake-news, etc. are possibly in quality and quantity never thought possible before.

However, Machine-Learning can be used to mitigate this problem: By training an artificial intelligence to identify such deepfakes, we can - in theory, at least - take appropriate action: Terminate a questionable call, display appropriate warning messages, or submit a dubious video to further, human analysis.

AI-driven deepfake detection

In order to build an AI deepfake detection system, we require a suitable dataset as follows: \[ (audio, label) \] where \(audio\) is an audio recording, and \(label\) represents the authentictiy of the file, e.g. either deepfake or authentic. We can then train a machine learning model to correctly predict, for each \(audio\) in the training data set, the corresponding label.

The ASVspoof dataset

A popular dataset for audio deepfake detection is the ASVspoof dataset, which has been published in the coresponding ASVspoof challenge:

This dataset consists of about 80.000 audio files, either spoofed (i.e. fake, computer-generated speech) or authentic speech (also called bona-fide). Using this dataset, one can train a deepfake detection AI such as RawNet2. The following table presents some examples from the ASVSpoof 2019 dataset. The columns A01-A04 represent four different TTS systems, while the column 'bona-fide' represents authentic, human speech.

Have a listen, and compare the quality of the authentic speech against the various synthesized sentences. A machine-learning model will do the same: Listen to audio files which it knows to be fake, compare them against audio files it will know to be authentic, and then figure out the minute details which seperate the two. For example, in the fake audio, you can hear that the speech is somewhat slurred and less clear in cormparision to the authentic audio.

TTS (fake) Bona-fide (authentic)
A01 A02 A03 A04 authentic