-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Ph.D. '17 | Variational Inference and Deep Learning: A New Synthesis. #21
Comments
Related Reference
|
Dimensionality reduction, PCA and autoencodersWhat is dimensionality reduction?In machine learning, dimensionality reduction is the process of reducing the number of features that describe some data. This reduction is done by
This reduction can be useful in many situations that require low dimensional data (data visualisation, data storage, heavy computation…). Although there exists many different methods of dimensionality reduction, we can set a global framework that is matched by most (if not any!) of these methods. First, let’s call:
The main purpose of a dimensionality reduction method is to find the best encoder/decoder pair among a given family, e.g. looking for the pair from a given set of possible encoders and decoders:
Principal components analysis (PCA)The idea of PCA is to build n_e new independent features that are linear combinations (looking for the best linear subspace of the initial space (described by an orthogonal basis of new features)) of the n_d old features and so that the projections of the data on the subspace defined by these new features are as close as possible to the initial data (in term of euclidean distance). Translated in our global framework, we are looking for an encoder in the family E of the n_e by n_d matrices (linear transformation) whose rows are orthonormal (features independence) and for the associated decoder among the family D of n_d by n_e matrices. AutoencodersThe general idea of autoencoders is pretty simple and consists in
Thus, intuitively, the overall autoencoder architecture (encoder+decoder) creates a bottleneck for data that ensures only the main structured part of the information can go through and be reconstructed. Let’s first suppose that both our encoder and decoder architectures have only one layer without non-linearity (linear autoencoder). Such encoder and decoder are then simple linear transformations that can be expressed as matrices. In such a situation, we can see a clear link with PCA in the sense that, just like PCA does, we are looking for the best linear subspace to project data on with as little information loss as possible when doing so. Encoding and decoding matrices obtained with PCA define naturally one of the solutions we would be satisfied to reach by gradient descent, but we should outline that this is not the only one. Indeed, several basis can be chosen to describe the same optimal subspace and, so, several encoder/decoder pairs can give the optimal reconstruction error. Now, let’s assume that both the encoder and the decoder are deep and non-linear. In such case, the more complex the architecture is, the more the autoencoder can proceed to a high dimensionality reduction while keeping reconstruction loss low. Intuitively, if our encoder and our decoder have enough degrees of freedom, we can reduce any initial dimensionality to 1. Indeed, an encoder with “infinite power” could theoretically takes our N initial data points and encodes them as 1, 2, 3, … up to N (or more generally, as N integer on the real axis) and the associated decoder could make the reverse transformation, with no loss during the process. Here, we should however keep two things in mind.
For these two reasons, the dimension of the latent space and the “depth” of autoencoders (that define degree and quality of compression) have to be carefully controlled and adjusted depending on the final purpose of the dimensionality reduction. |
Definition of variational autoencodersSo, in order to be able to use the decoder of out autoencoder for generative purpose, we have to be sure that the latent space is regular enough. One possible solution to obtain such regularity is to introduce explicit regularization during the training process. And a variational autoencoder can be defined as being an autoencoder whose training is regularized to avoid overfitting and ensure that the latent space has good properties that enable generative process.
That regularization term is expressed as the Kulback-Leibler divergence between the returned distribution and a standard Gaussian. We can notice that the Kulback-Leibler divergence between two Gaussian distributions has a closed form that can be directly expressed in terms of the means and the covariance matrices of the two distributions. |
Mathematical details of VAEsIn the previous section we gave the following intuitive overview:
Probabilistic framework and assumptionsLet's begin by defining a probabilistic graphical model to describe our data. We denote by
At this point, we can already notice that the regularization of the latent space that we lacked in simple autoencoders naturally appears here in the definition of the data generation process: encoded representations Let's now make the assumption (regularization):
The function Let's consider, for now, that Note. Here we can mention that |
Variational inference formulationIn statistics, variational inference(VI) is a technique to approximate complex distributions. The idea is to set a parameterized family of distribution (for example the family of Gaussians, whose parameters are the mean and the covariance) and to look for the best approximation of our target distribution among this family. The best element in the family is one that minimizes a given approximation error measurement (most of the time the Kullback-Leibler divergence between approximation and target) and is found by gradient descent over the parameters that describe the family. Please refer to Variational Inference for more details. So, we have defined this way a family of candidates for variational inference and need now to find the best approximation among this family by optimizing the functions In the second last equation, we can observe the trade-off there exists, when approximating the posterior
This trade-off is natural for Bayesian inference problem and express the balance that needs to be found between the confidence we have in the data and the confidence we have in the prior. Up to now, we have assumed the function
So, let's consider that, as we discussed earlier, we can get for any function where We can identify in this objective function the elements introduced in the intuitive description of VAEs given in the previous section:
We can also notice the constant |
Bringing neural networks into the modelUp to now, we have set a probabilistic model that depends on three functions, In practice, As it defines the covariance matrix of Contrarily to the encoder part that models The overall architecture is then obtained by concatenating the encoder and the decoder parts. However we still need to be very careful about the way we sample from the distribution returned by the encoder during the training. The sampling process has to be expressed in a way that allows the error to be backpropagated through the network. A simple trick, called representation trick, is used to make the gradient descent possible despite the random sampling that occurs halfway of the architecture and consists in using the fact that if Finally, the objective function of the variational autoencoder architecture obtained this way is given by the last equation of the previous subsection in which the theoretical expectancy is replaced by a more or less accurate Monte-Carlo approximation that consists, most of the time, into a single draw. So, considering this approximation and denoting |
TakeawaysThe main takeaways of thie article are:
To conclude, we can outline that, during the last years, GANs have benefited from much more scientific contributions than VAEs. Among other reasons, the higher interest that has been shown by the community for GANs can be partly explained by the higher degree of complexity in VAEs theoretical basis (probabilistic model and variational inference) compared to the simplicity of the adversarial training concept that rules GANs. |
Kingma, Diederik Pieter. Variational inference & deep learning: A new synthesis.
The text was updated successfully, but these errors were encountered: