Variational Autoencoders: Intuitions and Math

Continuing with my series of articles about generative models, in this post, we will explore one type of generative model based on one of the oldest deep learning techniques, the Variational Autoencoders [1] (or VAEs). The reason to explore these kinds of models is that, although their architecture is somewhat simple, I have found the intuition behind them helpful to understanding larger and more complex models such as Generative Adversarial Networks (GANs) [2] or Diffusion Models [3]. We are going to realize what Variational Autoencoders are and how they differ from other Autoencoder models [4], and we will go a little deeper into the math behind them. So, without further ado, I invite you to keep reading.

What is an Autoencoder?

Before diving into what variational autoencoders are, we need to look at what an autoencoder is, which is the core architecture behind VAEs.

Autoencoder Architecture. Images source: "From Autoencoder to Beta-VAE". — Autoencoder Architecture. Source: "From Autoencoder to Beta-VAE". https://lilianweng.github.io/posts/2018-08-12-vae/

An autoencoder is a neural network whose objective is to reconstruct their input. Although there are different variations of the autoencoder network architecture, in the more classical scenario, there is a bottleneck built in the network, which has the objective of the autoencoder being able to learn a way to reduce the dimensionality of the input. It can be seen as a combination of two feed-forward neural networks: the encoder and the decoder.

The encoder part of the autoencoder learns a function g, parameterized by ϕ, from the dimension of the input space x that reduces it to a lower dimension, usually called the "latent dimension" and generally noted by the variable name z.
The decoder part of the autoencoder learns a function f, parameterized by θ, from the latent dimension of z and expands it again to the original input dimension of x.

Since each of the networks that form the autoencoder is independent of each other, there's the freedom to build them however you want, but in general, they are mirrors of each other. For example, if the encoder network is a feed-forward network with a hidden layer, the decoder network will be the same, only reversed. The parameters of these two networks are also learned together during the same training procedure. The loss function for training an autoencoder is the reconstruction of the input, generally with a mean squared error objective:

What are autoencoders learning?

The intuition behind autoencoders is that they remove unnecessary information from the input when making the reduction. In theory, this could be done with any other dimensionality reduction technique, like Principal Component Analysis (PCA) [5] or Matrix Factorization [6], but with the added advantage that they can use non-linearity to find more complex mappings between the different vector spaces, which should, in theory, gave them more representative power (although this is not such in practice).

In an ideal scenario, autoencoders should learn latent attributes from the input data they are trained with. For example, if the input data are images, an autoencoder should learn to distinguish certain intrinsic attributes of that image.

Latent Attributes of an Autoencoder. — Latent Attributes of an Autoencoder

In the image example, specific latent attributes can be gathered from the image that helps the decoder model reconstruct the picture. Of course, this is a gross simplification of an ideal scenario since the actual values of the latent dimension and what they represent are challenging to analyze and depend on the data used to train the autoencoder.

Denoising Autoencoders: Learning to Remove Noise

A common problem with autoencoders, as with any feed-forward neural network with enough parameters, is the tendency to overfit the training data, especially when the architecture is large enough. One solution the machine learning community encountered for this problem was the work by Vincent et al. 2008 [7], which presented the "Denoising Autoencoders" (DAE or sometimes dAE).

DAEs provided a simple modification to the autoencoder architecture: corrupting the input, either by adding Gaussian noise or randomly masking the inputs with the help of a Dropout [8]. The model is trained to remove the noise (denoise) from the corrupted data to reconstruct the original input.

Denoising Autoencoder Architecture. Source: "From Autoencoder to Beta-VAE". https://lilianweng.github.io/posts/2018-08-12-vae/

The input to reconstruct is the original x, but the model takes the corrupted x through the autoencoder network. The loss is slightly modified to reflect this:

This form of regularization is particularly useful for cases like images, where the input is high-dimensional but has a lot of redundancy. The model can learn a more robust latent representation from the input data by needing a larger number of dimensions to reconstruct the input by removing the noise, avoiding overfitting from a small number of dimensions.

The Limit of Autoencoders

Denoising autoencoders takes us a little further in the direction of variational autoencoders in the sense that there is a particular structure in the learning process that gives more generic representations. However, if we want a generative model, autoencoders lack the structure to do so.

In last week's article, we explored a little about what diffusion models are [9]. One key aspect of these models is that they learn how to generate something from noise by gradually learning to reduce the noise to create an image, for example. The idea of denoising autoencoders is somewhat similar, but they lack the fundamental structure to be able to do it from any noisy input. The way the model learns requires the input of the decoder to be from a latent space that was specialized in learning a reduced version from the input but without any extra conditions on how this is done whatsoever. If we randomly sample some vector from the latent space, there's no guarantee that the decoder model will be able to generate something intelligible from it, let alone something that resembles the distribution of the data the model was trained on.

Decoding a Random Sample from an Autoencoder. Source: "Understanding Variational Autoencoders". https://towardsdatascience.com/understanding-variational-autoencoders-vaes-f70510919f73

The image above puts this better into perspective; the random sample for the purple value doesn't have to be anything that the decoder can reconstruct helpfully. This is the scenario where variational autoencoders come in handy. The idea of a VAE is not just to learn a reduction process from the training data but also to do so in a way that the decoder is helpful for generating new data. To do so, the VAE learns to map not to a fixed latent vector but to the distribution of the vectors in the latent space.

Intuition Behind Variational Autoencoder

Autoencoders have the objective of learning latent representations that ideally map to some latent attributes. In the case of variational autoencoders, we want those latent attributes to be distributions:

Latent Attributes: Autoencoder vs Variational Autoencoders.

In the image above, there are two versions of the same idea: the autoencoder one and the VAE one. In the left column, for each image, the latent attribute of "smile" is represented as a single discrete value (not a discrete number, but a fixed one). On the right column, there's a spectrum representing that value based on a probabilistic distribution. If we apply that same logic to each of the latent dimensions the autoencoder learns, we have the idea behind a VAE:

Latent Attributes of a Variational Autoencoder.

Because of the way the VAEs are trained, we are forcing the encoder not to map a single value for each input in the training data but to map it to a range of values under a specific distribution that represents the data. The latent space becomes "regular" and the decoder is now able to reconstruct different point samples from this latent space:

Sampling from a Variational Autoencoder Latent Space.

VAEs aim to find an encoder that maps the training data to a complete and continuous latent space distribution.

A complete distribution means that if we sample a random point from the latent space and give it to the decoder, it should produce a "meaningful" decoded output in terms of the dataset distribution. For example, if we trained the model with human faces, the output should resemble a face, i.e., a pair of eyes over a nose over a mouth.
A continuous distribution means that if we sample two points that are close to each other in the latent space, the output of these two inputs should be similar when decoded.

These two conditions are not valid in an autoencoder, which in turn can't generate data from any given random sample of the latent distribution.

Irregular vs. Regular Latent Spaces. Source: "Understanding Variational Autoencoders". https://towardsdatascience.com/understanding-variational-autoencoders-vaes-f70510919f73

The image above graphically shows the behavior of an autoencoder vs a variational autoencoder.

Variational Autoencoders

Now that we have a better understanding of the intuition behind VAEs and how they differ from regular autoencoders, we can delve a little into how we can achieve this idea of a regular latent space.

Probabilistic Background

The latent space is represented by a latent variable z that generates the observations of x from the regular space that we have and from which we want to generate data. Now, from the data, we can only observe x, and what we want to do is infer z from these observations. Thus, we want the probability of z given an observation of x: p(z|x), which by Bayes' theorem can be calculated by:

However, this has a problem: in the denominator of the equation, we need the probability of x to calculate p(z|x), which formula is the following:

The problem is that it is an intractable distribution since it's computationally expensive to go through all the values of z to obtain the probability of x. The solution proposed by Kingma and Welling in their works [1] was to use variational inference to approximate the value. The idea of variational inference is to approximate the intractable p(z|x) distribution by a tractable q(z|x) defined by us.

The error metric we can use to calculate how similar two different data distributions are is the Kullback-Leibler divergence (or KL divergence) [10] by minimizing the KL divergence between q and p:

We want to minimize with respect to ϕ and θ; if we follow the derivation [11], we reach the following result:

Which we can rearrange like:

From this, we have that the term of the left is the one we want to maximize, i.e., the log-likelihood of generating accurate data (i.e., log p(x)) while minimizing the difference between the real distribution p and our estimated distribution q. Thus, to obtain the loss function of VAEs, we have:

This function is called evidence lower bound (or ELBO) and is the loss function of the VAEs.

Implementation

So, how do we implement VAEs? Well, as we said before, we are trying to find the distribution of the data, or in other words, the mean and standard deviation of it. Thus, instead of the encoder outputting the discrete values of some latent space, it will return the mean and standard deviation of such values for each of the dimensions of the latent space. The decoder will sample data from the generated distribution of the probabilistic encoder and will reconstruct the original input in the training process.

Variational Autoencoder Architecture. Source: "From Autoencoder to Beta-VAE". https://lilianweng.github.io/posts/2018-08-12-vae/

One problem when training the VAEs is the need for sampling values of z from q, which is a stochastic process and can't backpropagate the gradient. Thus, we can't train it using classic backpropagation techniques. The solution proposed by Kingma and Welling [1] was to use what they called the "reparameterization trick". To approximate the random variable z, we can sample a value from a Gaussian distribution with a mean of 0 and standard deviation of 1 and shift and scale it with the values of the distribution of the probabilistic encoder:

Once we have the sampled value, we can backpropagate as usual:

Visualization of VAEs Latent Dimensions

As we explained before, the idea of using the VAEs is to regularize the latent space so that when the decoder samples from it, this is complete and continuous. But how does the use of the ELBO function help with that? A way to see the KL divergence in the ELBO function is as a regularization term; the stronger it is, the more regularized the space is, at the cost of being less representative of each term. We can see this in the following image:

VAEs Latent Space Visualization. Source: "Intuitively Understanding Variational Autoencoders'. https://towardsdatascience.com/intuitively-understanding-variational-autoencoders-1bfe67eb5daf

As you can see from the image above, if we ignore the KL divergence and only focus on reconstruction, the latent space will be very much distributed in clusters, each of which doesn't have any direct connection to what the VAE is representing and if sampling from a random point in this distribution there isn't any guarantee that the decoder will be able to generate anything useful. On the other hand, if we only use KL divergence, there's no way to discern in the encoding process what each of the data points is assigned to. Finally, the combination of both terms in the loss gives us a more complete and continuous latent space, where sampling data can be fed to the probabilistic decoder to generate new data points that don't have to be present in the original training dataset.

Final Remarks

In this article, we explore Variational Autoencoders (or VAEs), starting with the autoencoders that provide their basis. VAEs are powerful tools and are among the first iterations of generative models that were successfully trained and used for image generation. They might not be as powerful as diffusion models, but they provide a nice background to understand them better. Moreover, I feel they are among the simplest generative models one can fully grasp in terms of technical background.

References

[1] Kingma, D. P., & Welling, M. (2013). Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114. https://arxiv.org/abs/1312.6114

[2] Creswell, A., White, T., Dumoulin, V., Arulkumaran, K., Sengupta, B., & Bharath, A. A. (2018). Generative adversarial networks: An overview. IEEE signal processing magazine, 35(1), 53-65. https://arxiv.org/abs/1406.2661

[3] Song, Y., & Ermon, S. (2019). Generative modeling by estimating gradients of the data distribution. Advances in neural information processing systems, 32. https://proceedings.neurips.cc/paper_files/paper/2019/hash/3001ef257407d5a371a96dcd947c7d93-Abstract.html

[4] Hinton, G. E., & Zemel, R. (1993). Autoencoders, minimum description length and Helmholtz free energy. Advances in neural information processing systems, 6. https://proceedings.neurips.cc/paper/1993/hash/9e3cfc48eccf81a0d57663e129aef3cb-Abstract.html

[5] Jolliffe, I. T., & Cadima, J. (2016). Principal component analysis: a review and recent developments. Philosophical transactions. Series A, Mathematical, physical, and engineering sciences, 374(2065), 20150202. https://doi.org/10.1098/rsta.2015.0202

[6] Koren, Y., Bell, R., & Volinsky, C. (2009). Matrix factorization techniques for recommender systems. Computer, 42(8), 30-37. https://ieeexplore.ieee.org/document/5197422

[7] Vincent, P., Larochelle, H., Bengio, Y., & Manzagol, P. A. (2008, July). Extracting and composing robust features with denoising autoencoders. In Proceedings of the 25th international conference on Machine learning (pp. 1096-1103). https://dl.acm.org/doi/10.1145/1390156.1390294

[8] Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., & Salakhutdinov, R. (2014). Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research, 15(1), 1929-1958.

[9] Cardellino, C. (2024). "A Brief Overview of Diffusion Models and Their Applications" https://www.transcendent-ai.com/post/a-brief-overview-of-diffusion-models-and-their-applications

[10] Kullback, S., & Leibler, R. A. (1951). On information and sufficiency. The annals of mathematical statistics, 22(1), 79-86. https://projecteuclid.org/journals/annals-of-mathematical-statistics/volume-22/issue-1/On-Information-and-Sufficiency/10.1214/aoms/1177729694.full

[11] Weng, L. (2018). "From Autoencoder to Beta-VAE" https://lilianweng.github.io/posts/2018-08-12-vae/