Variational Auto Encoder for TTS

Variational Auto Encoders (VAE) are popular models in unsupervised learning. They have been introduced by Kingma et Al. in 2013, sparking a surge of research interest.

Most often, people are interested in VAEs for discovering latent representations in images. The basic idea is that we augment a traiditional auto-encoder's bottleneck with a stochastic sampling layer, as shown below. \(x\) represents the (5-pixel) input image, while \(\hat{x}\) represents the reconstruction. The yellow and green bottleneck layers compute parameters \(\mu\) and \(\sigma\) for a multivariate, isotropic gaussian distribution (i.e. a gaussian distribution with a diagonal covariance matrix). From this, we sample latent variables \(z\) (green box). These are then supplied to the neural decoder, which computes the reconstructions.

Responsive image

A trained VAE can be used for two purposes: First, to derive meaningful latents \(z\) from some input \(x\). This is useful when we want to better understand the input, in order to better control some downstream model. The other use-case is to generate new data: Sampling (either randomly or controlled) latents \(z\) and then decoding them using the decoder yields new data \(x\), where we can control the latent attributes.

This model is trained not via Mean-Squard Error (as the traditional Auto-Encoder), but using negative ELBO, or 'Evidence Lower Bound'. An excellent explanation of this concept can be found in this Youtube Tutorial. To summarize, given a variational encoder \( q_\phi(z|x) \) which maps data \(x\) to a hidden representation \(z\), and for a decoder \( p_{\theta_a}(x|z) \) which maps a hidden representation \(z\) to \(x\), and an isotropic gaussian prior \( p_{\theta_b}(z) \), we maximize the following objective: \[ ELBO = \underbrace{\mathbb{E}_{z \sim q_\phi} \log p_{\theta_a}(x|z)}_{\text{Reconstruction quality}} - \underbrace{\mathbb{E}_{z \sim q_\phi} \log \frac{q_\phi (z|x)}{p_{\theta_b}(z)}}_{\text{KL between approximate posterior and prior}} \]

So what are Variational Auto Encoders used for in Text-to-Speech? Fundamentally, for the same reasons as in image processing: To discover latent attributes \(z\) in the (audio) data in an unsupervised fashion. We list some of the most interesting papers in TTS using VAE here:

Discovering such alignments automatically is tremendously helpful, since annotating speech alignments manually is extremely costly and tedious.