This website presents a variety of aspects regarding audio synthesis, deepfakes, and deepfake detection. Maybe most importantly, we show results of current state-of-the-art audio synthesis algorithms:
We create a set of sentences which are spoken by an artificial intelligence (AI). However, the AI does not simply reproduce audio recordings based on the training data; it creates entirely new speakers. Thus, for every sentence you hear, the corresponding speaker (along with its vocal tonality) does not exist!
To achieve this, we are using deep learning. More precisely, we are using a deep-learning text-to-speech (TTS) model. These models work by learning how speech relates to spoken audio via a dataset <text, audio>. Such datasets are presented to the learning model, which after about 2-5 days of training can speak new sentences.
If you listen closely, you can make out artefacts in the synthesized speech. For example, you can notice that the tonality or prosody of the sentences does not always match the semantics, i.e. the content. This is because despite all advances in AI, machine-learning is at its core 'pattern recognition', not true intelligence. The TTS model does learn the relationship between text and audio, but it does not 'understand' the meaning of the sentence.
Please feel free to contact me if you have any questions.