Variational Auto Encoders (VAE) are popular models in unsupervised learning. They have been introduced by Kingma et Al. in 2013, sparking a surge of research interest.

Most often, people are interested in VAEs for discovering latent representations in images. The basic idea is that we augment a traiditional auto-encoder's bottleneck with a stochastic sampling layer, as shown below. \(x\) represents the (5-pixel) input image, while \(\hat{x}\) represents the reconstruction. The yellow and green bottleneck layers compute parameters \(\mu\) and \(\sigma\) for a multivariate, isotropic gaussian distribution (i.e. a gaussian distribution with a diagonal covariance matrix). From this, we sample latent variables \(z\) (green box). These are then supplied to the neural decoder, which computes the reconstructions.

A trained VAE can be used for two purposes: First, to derive meaningful latents \(z\) from some input \(x\). This is useful when we want to better understand the input, in order to better control some downstream model. The other use-case is to generate new data: Sampling (either randomly or controlled) latents \(z\) and then decoding them using the decoder yields new data \(x\), where we can control the latent attributes.

This model is trained not via Mean-Squard Error (as the traditional Auto-Encoder), but using negative ELBO, or 'Evidence Lower Bound'. An excellent explanation of this concept can be found in this Youtube Tutorial. To summarize, given a variational encoder \( q_\phi(z|x) \) which maps data \(x\) to a hidden representation \(z\), and for a decoder \( p_{\theta_a}(x|z) \) which maps a hidden representation \(z\) to \(x\), and an isotropic gaussian prior \( p_{\theta_b}(z) \), we maximize the following objective: \[ ELBO = \underbrace{\mathbb{E}_{z \sim q_\phi} \log p_{\theta_a}(x|z)}_{\text{Reconstruction quality}} - \underbrace{\mathbb{E}_{z \sim q_\phi} \log \frac{q_\phi (z|x)}{p_{\theta_b}(z)}}_{\text{KL between approximate posterior and prior}} \]

So what are Variational Auto Encoders used for in Text-to-Speech? Fundamentally, for the same reasons as in image processing: To discover latent attributes \(z\) in the (audio) data in an unsupervised fashion. We list some of the most interesting papers in TTS using VAE here:

- The paper Conditional Variational Autoencoder with Adversarial Learning for End-to-End Text-to-Speech by Kim et Al. They use VAE together with normalizing flows to automatically discover alignments between input characters \(c\) and latent phonems \(z\). This allows them to train a TTS model, which supports explicit phoneme durations, without the need of explicit phoneme-character alignments.
- A similar approach is taken by Liu et Al. in VARA-TTS: Non-Autoregressive Text-to-Speech Synthesis based on Very Deep VAE with Residual Attention. Using a very deep VAE (they call it VDVAE), they discover the acoustic-textual alignments automatically.

Discovering such alignments automatically is tremendously helpful, since annotating speech alignments manually is extremely costly and tedious.