# Deepfake Detection Architectures

Audio Deepfakes can be detected using machine learning as follows: Given a dataset $$(x, y)$$ where $$x$$ represents the audio and $$y$$ the label (either spoof or authentic), a popular approach is to train a model $f: x \to y$ which takes as input the audio $$x$$, and outputs a label $$y \in [0, 1]$$. A popular dataset for training is the ASVspoof dataset.

Such models can be either classifical models (SVM, kNN, etc.), but usually are neural networks. In the following, we present some of the most popular deepfake detection architectures.

#### Mel-spectrogram based models

Spectogram-based models are the first category of deepfake detection systems. They include a preprocessing step of the raw audio waveform to mel-frequency spectogram, as shown below:

The x-axis represents the time, and the y-axis represents the frequency (i.e. the pitch of the voice, ranging from very low (20hz) to very high (8000hz)). The z-axis (color intensity) represents the energy at that time, at that frequency. This spectrogram corresponds to the sentence: 'Prince Percy was always mindful of this sense of passing the baton.'. You can see the long pause at the beginning of the sentence and the shorter pause at the end.

Deepfake detection models will take an audio, convert it to MFCC representation, and then process this MFCC similarly to how one would process a conventional image. Popular architectures for this process include:

• Resnet18, as used by Alzantot et Al. Here, the authors use the popular ResNet18 architecture from the image classification domain and, by the means of MFCC-preprocessing, adapt it straightforwardly to audio spoof detection.
• Transformer Networks, which also originate from computer vision, have been adopted to audio spoof detection by Zhang et Al.

#### Raw models

MFCC-based models are popular, because we can use techniques from computer vision and apply them directly. Also, the image representation is computationally efficient (elimiates long term dependencies in the time-domain waveform) and additionally affords some model explainability. However, there is a downside to MFCC based models: We potentially lose information during the MFCC conversion. This is why raw models are appealing, which work directly on the raw, unprocessed waveform:

Here, we have the time dimension on the x axis, and the y axis shows the pressure as recorded by the microphone. Raw models have shown exceptional performance in deepfake detection. Some of the most popular models are:

• RawNet2, which uses sinc filters, followed by residual blocks. The sinc filters emulate MFCC computation, but are incorporated directly into the model. The paper can be found here, and corresponding source code here.
• RawGat-ST, another highly evolved end-to-end architecture. See the source code here, and read more about the model itself here.