 # Variational Auto Encoder for TTS

Variational Auto Encoders (VAE) are popular models in unsupervised learning. They have been introduced by Kingma et Al. in 2013, sparking a surge of research interest.

Most often, people are interested in VAEs for discovering latent representations in images. The basic idea is that we augment a traiditional auto-encoder's bottleneck with a stochastic sampling layer, as shown below. $$x$$ represents the (5-pixel) input image, while $$\hat{x}$$ represents the reconstruction. The yellow and green bottleneck layers compute parameters $$\mu$$ and $$\sigma$$ for a multivariate, isotropic gaussian distribution (i.e. a gaussian distribution with a diagonal covariance matrix). From this, we sample latent variables $$z$$ (green box). These are then supplied to the neural decoder, which computes the reconstructions. A trained VAE can be used for two purposes: First, to derive meaningful latents $$z$$ from some input $$x$$. This is useful when we want to better understand the input, in order to better control some downstream model. The other use-case is to generate new data: Sampling (either randomly or controlled) latents $$z$$ and then decoding them using the decoder yields new data $$x$$, where we can control the latent attributes.

This model is trained not via Mean-Squard Error (as the traditional Auto-Encoder), but using negative ELBO, or 'Evidence Lower Bound'. An excellent explanation of this concept can be found in this Youtube Tutorial. To summarize, given a variational encoder $$q_\phi(z|x)$$ which maps data $$x$$ to a hidden representation $$z$$, and for a decoder $$p_{\theta_a}(x|z)$$ which maps a hidden representation $$z$$ to $$x$$, and an isotropic gaussian prior $$p_{\theta_b}(z)$$, we maximize the following objective: $ELBO = \underbrace{\mathbb{E}_{z \sim q_\phi} \log p_{\theta_a}(x|z)}_{\text{Reconstruction quality}} - \underbrace{\mathbb{E}_{z \sim q_\phi} \log \frac{q_\phi (z|x)}{p_{\theta_b}(z)}}_{\text{KL between approximate posterior and prior}}$

So what are Variational Auto Encoders used for in Text-to-Speech? Fundamentally, for the same reasons as in image processing: To discover latent attributes $$z$$ in the (audio) data in an unsupervised fashion. We list some of the most interesting papers in TTS using VAE here:

Discovering such alignments automatically is tremendously helpful, since annotating speech alignments manually is extremely costly and tedious.