Multispeaker TTS

Traditional Text-to-Speech (TTS) learns to speak in exactly one voice, namely the one from the training data set. However, it may often be desirable to have a plethora of different speakers available (possibly even ones that were never included in the training dataset). This is what Multispeaker-TTS aims to solve. In the following, I will present the basic idea as laid out in 'Transfer Learning from Speaker Verification to Multispeaker Text-To-Speech Synthesis' by Jia et Al.

This figure represents the architecture of a multi speaker system as proposed by Jia et Al. We can identify three major components:

• Green: Training data
• Red: Neural Networks which consitute the TTS model
• Yellow: Intermediate Speaker Representation

Training data

We see that training data no longer consists of tuples $text, audio$ as was the case in conventional TTS. Instead, we have tuples $text, speaker, audio$ This means that for each sentence, we know which speaker spoke it (encoded by some numerical id). We will use this to make the model learn to change between speakers (and also make up new ones).

Neural Architecture

For a multi-speaker architecture as presented here, we need two neural networks: First, a speaker encoder, which learns to map a speaker-id to some vector representation (shown in yellow). $\text{speaker_encoder}: id \to [v_1, v_2, ..., v_d].$ The neural TTS Model then takes this speaker-representation or speaker-embedding (yellow) and the correpsonding input text, and learns to produce a corresponding output audio.

Synthesizing new speakers

We can create arbitrary new speakers by manipulating the speaker-representation $[v_1, v_2, ..., v_d].$ For example, we can either combine two existing speakers $$s_1, s_2$$ by averaging their speaker-representations via $s_{new} = \frac{s_1 + s_2}{2}.$ Alternatively, we can randomly sample a speaker by drawing from a, for example, $$d$$-dimensional gaussian distribution: $s_{rand} \sim \mathcal{N}(0, 1)^d.$ During inference, we can then completely bypass the speaker encoder and directly supply a new speaker embedding to the TTS model, as shown below. The speakers presented on this website were created in similar a manner.

When we keep the speaker embedding fixed, we can generate a set of sentences, all spoken by the same (non-existing) speaker.

Limitations

While this approach is successful, it has its limitations:

• No control over speaker attributes. We are not able to manually set speaker characteristics such as age, gender, pitch etc. We can, however, find speakers that are 'similar' by modifying the speaker embedding only slightly, e.g. by adding only some small perturbation.
• Target cloning. This approach does not allow to clone a target speaker, i.e. it is unsuitable for deepfake creation (due to the fact that we do not understand how the speaker vector influences the resulting audio's characteristics).

Usages

Where might one use such a multi-speaker system? For example for movie or video-game creation, where one might require a large number of side-characters, each of which speak one or two sentences in their own voice. Currently, game designers need to cast a voice actor (i.e. a human) for each of these characters. Employing Multispeaker-TTS technology could help to greatly reduce cost and increase diversity.