Text-to-Speech Introduction

Text-to-Speech (TTS) technologie is a machine-learning technique that allows to synthesize human speech, using a neural network. At its heart, a TTS system learns a mapping \[ \text{text} \to \text{audio}. \] Usually, we write this mapping as follows: \[ f: x \to y \] Here, \(x\) represents text such as 'Hello, how are you?'. And \(y\) represents the corresponding audio waveform, i.e. the human speech of this sentence. This means that the TTS system (which, here, has the name \(f\) ) can take any text and produce the corresponding audio waveform.

Now, the question arises how such a model can be learned. Fundamentally, there are two requirements: A suitable dataset, and a suitable model architecture.

Without going into details, these models learn to speak just like a baby: They are shown example sentences from the training dataset and can 'listen' to the corresponding output. Then, they are tasked to reproduce this output, given the text input. When they have done such training over many examples, they start to generalize, figuring out how to speak new sentences that are not present in the training data.

Consider the following examples. The show a text-to-speech model's progress during an early, intermediate and later stages of training. Listen to how the speech sound more fluent as the model's learning progresses.

Sentence 1 Sentence 2 Sentence 3
Early Stage
Intermediate Stage
Later Stage
Final stage