# Text-to-Speech Introduction

Text-to-Speech (TTS) technologie is a machine-learning technique that allows to synthesize human speech, using a neural network. At its heart, a TTS system learns a mapping $\text{text} \to \text{audio}.$ Usually, we write this mapping as follows: $f: x \to y$ Here, $$x$$ represents text such as 'Hello, how are you?'. And $$y$$ represents the corresponding audio waveform, i.e. the human speech of this sentence. This means that the TTS system (which, here, has the name $$f$$ ) can take any text and produce the corresponding audio waveform.

Now, the question arises how such a model can be learned. Fundamentally, there are two requirements: A suitable dataset, and a suitable model architecture.

• The dataset is what allows the artificial intelligence to learn the relationship between text and speech. All machine learning systems require a training dataset, and TTS is no exception. The dataset thus consists of pairs $(x_i, y_i)$ where each $$i$$ indicates a different training sample. So, for example, we might have 10.000 sentences and 10.000 recordings of a human speaking these sentences. This would yield a dataset with which we could train a machine learning model.
• The second requirement is a machine-learning model $$f$$ to which to give the dataset. In TTS, such a model usually is a neural network. A popular TTS model is the Tacotron 2 architecture.

Without going into details, these models learn to speak just like a baby: They are shown example sentences from the training dataset and can 'listen' to the corresponding output. Then, they are tasked to reproduce this output, given the text input. When they have done such training over many examples, they start to generalize, figuring out how to speak new sentences that are not present in the training data.

Consider the following examples. The show a text-to-speech model's progress during an early, intermediate and later stages of training. Listen to how the speech sound more fluent as the model's learning progresses.

Sentence 1 Sentence 2 Sentence 3
Early Stage
Intermediate Stage
Later Stage
Final stage