Deepfake Detection Architectures

Audio Deepfakes can be detected using machine learning as follows: Given a dataset \((x, y)\) where \(x\) represents the audio and \(y\) the label (either spoof or authentic), a popular approach is to train a model \[f: x \to y \] which takes as input the audio \(x\), and outputs a label \(y \in [0, 1]\). A popular dataset for training is the ASVspoof dataset.

Such models can be either classifical models (SVM, kNN, etc.), but usually are neural networks. In the following, we present some of the most popular deepfake detection architectures.

Mel-spectrogram based models

Spectogram-based models are the first category of deepfake detection systems. They include a preprocessing step of the raw audio waveform to mel-frequency spectogram, as shown below:

The x-axis represents the time, and the y-axis represents the frequency (i.e. the pitch of the voice, ranging from very low (20hz) to very high (8000hz)). The z-axis (color intensity) represents the energy at that time, at that frequency. This spectrogram corresponds to the sentence: 'Prince Percy was always mindful of this sense of passing the baton.'. You can see the long pause at the beginning of the sentence and the shorter pause at the end.

Deepfake detection models will take an audio, convert it to MFCC representation, and then process this MFCC similarly to how one would process a conventional image. Popular architectures for this process include:

Raw models

MFCC-based models are popular, because we can use techniques from computer vision and apply them directly. Also, the image representation is computationally efficient (elimiates long term dependencies in the time-domain waveform) and additionally affords some model explainability. However, there is a downside to MFCC based models: We potentially lose information during the MFCC conversion. This is why raw models are appealing, which work directly on the raw, unprocessed waveform:

Here, we have the time dimension on the x axis, and the y axis shows the pressure as recorded by the microphone. Raw models have shown exceptional performance in deepfake detection. Some of the most popular models are: