Variational Autoencoder

I. Introduction and Autoencoder

Many applications such as image synthesis, denoising, super-resolution, speech synthesis or compression, require to go beyond classification and regression and model explicitly a high-dimensional signal.
This modeling consists of finding “meaningful degrees of freedom”, or “factors of variations”, that describe the signal and are of lesser dimension.
Autoencoders are an unsupervised learning technique in which we leverage neural networks for the task of representation learning. Specifically, we’ll design a neural network architecture such that we impose a bottleneck in the network which forces a compressed knowledge representation of the original input.

Mathematically, an Autoencoder is a composite function made of

an encoder from the original space $\mathcal{X}$ to a latent space $\mathcal{Z}$ ,
a decoder to map back to $\mathcal{X}$ ,

such that $g \circ f$ is close to the identity on the data.

Let $p(\mathbf{x})$ be the data distribution over $\mathcal{X}$ . A good auto-encoder could be characterized with the reconstruction loss

$\mathbb{E}_{\mathbf{x} \sim p(\mathbf{x})}\left[\|\mathbf{x}-g \circ f(\mathbf{x})\|^{2}\right] \approx 0$

Given two parameterized mappings $f\left(\cdot ; \theta_{f}\right) \text { and } g\left(\cdot ; \theta_{g}\right)$ training consists of minimizing an empirical estimate of that loss

$\theta_{f}, \theta_{g}=\arg \min _{\theta_{f}, \theta_{g}} \frac{1}{N} \sum_{i=1}^{N}\left\|\mathbf{x}_{i}-g\left(f\left(\mathbf{x}_{i}, \theta_{f}\right), \theta_{g}\right)\right\|^{2}$

For example, when the auto-encoder is linear,
$\begin{aligned} &f: \mathbf{z}=\mathbf{U}^{T} \mathbf{x} \\ &g: \hat{\mathbf{x}}=\mathbf{U} \mathbf{z} \end{aligned}$

with $\mathbf{U} \in \mathbb{R}^{p \times d}$ the reconstruction error reduces to
$\mathbb{E}_{\mathbf{x} \sim p(\mathbf{x})}\left[\left\|\mathbf{x}-\mathbf{U} \mathbf{U}^{T} \mathbf{x}\right\|^{2}\right]$
In this case, an optimal solution is given by PCA.
Deep auto-encoders

Better results can be achieved with more sophisticated classes of mappings than linear projections: use deep neural networks for and .
For instance,

by combining a multi-layer perceptron encoder $f: \mathbb{R}^{p} \rightarrow \mathbb{R}^{d}$ with a multi-layer perceptron decoder $g: \mathbb{R}^{d} \rightarrow \mathbb{R}^{p}$
by combining a convolutional network encoder $f: \mathbb{R}^{w \times h \times c} \rightarrow \mathbb{R}^{d}$ with a decoder $g: \mathbb{R}^{d} \rightarrow \mathbb{R}^{w \times h \times c}$ composed of the reciprocal transposed convolutional layers.

Compare Autoencoder with PCA on MNIST dataset

Denoising Autoencoder
Since the autoencoder learns the identity function, we are facing the risk of “overfitting” when there are more network parameters than the number of data points.
To avoid overfitting and improve the robustness, Denoising Autoencoder (Vincent et al. 2008) proposed a modification to the basic autoencoder. The input is partially corrupted by adding noises to or masking some values of the input vector in a stochastic manner, $\tilde{\mathbf{x}} \sim \mathcal{M}_{\mathcal{D}}(\tilde{\mathbf{x}} \mid \mathbf{x})$ . Then the model is trained to recover the original input (note: not the corrupt one).
$\begin{aligned} \tilde{\mathbf{x}}^{(i)} & \sim \mathcal{M}_{\mathcal{D}}\left(\tilde{\mathbf{x}}^{(i)} \mid \mathbf{x}^{(i)}\right) \\ L_{\mathrm{DAE}}(\theta, \phi) &=\frac{1}{n} \sum_{i=1}^{n}\left(\mathbf{x}^{(i)}-f_{\theta}\left(g_{\phi}\left(\tilde{\mathbf{x}}^{(i)}\right)\right)\right)^{2} \end{aligned}$
where $\mathcal{M}_{\mathcal{D}}$ defines the mapping from the true data samples to the noisy or corrupted ones.
Sampling from an AE’s latent space
The generative capability of the decoder in an auto-encoder can be assessed by introducing a (simple) density model over the latent space $\mathcal{Z}$ , sample there, and map the samples into the data space $\mathcal{X}$ with .

For instance, a factored Gaussian model with diagonal covariance matrix,
$q(\mathbf{z})=\mathcal{N}(\hat{\mu}, \hat{\Sigma})$
where both $\hat{\mu} \text { and } \hat{\Sigma}$ are estimated on training data.

These results are not satisfactory because the density model on the latent space is too simple and inadequate.
Building a good model in latent space amounts to our original problem of modeling an empirical distribution, although it may now be in a lower dimension space.

II. Variational Inference

Latent variable model

Consider for now a prescribed latent variable model that relates a set of observable variables $\mathbf{x} \in \mathcal{X}$ to a set of unobserved variables $\mathbf{z} \in \mathcal{Z}$ .

The probabilistic model defines a joint probability distribution $p(\mathbf{x}, \mathbf{z})$ , which decomposes as
$p(\mathbf{x}, \mathbf{z})=p(\mathbf{x} \mid \mathbf{z}) p(\mathbf{z})$
For a given model $p(\mathbf{x}, \mathbf{z})$ , inference consists in computing the posterior
$p(\mathbf{z} \mid \mathbf{x})=\frac{p(\mathbf{x} \mid \mathbf{z}) p(\mathbf{z})}{p(\mathbf{x})}$
For most interesting cases, this is usually intractable since it requires evaluating the evidence
$p(\mathbf{x})=\int p(\mathbf{x} \mid \mathbf{z}) p(\mathbf{z}) d \mathbf{z}$ : Intractable!

Variational inference turns posterior inference into an optimization problem.

Consider a family of distributions $q(\mathbf{z}|\mathbf{x}; \nu)$ that approximate the posterior $p(\mathbf{z}|\mathbf{x})$ , where the variational parameters $\nu$ index the family of distributions.
The parameters $\nu$ are fit to minimize the KL divergence between the approximation $q(\mathbf{z}|\mathbf{x};\nu)$ and the posterior $p(\mathbf{z}|\mathbf{x})$ .
Formally, we want to solve

$\begin{aligned} & \arg \min _{\nu} \mathrm{KL}(q(\mathbf{z} \mid \mathbf{x} ; \nu) \| p(\mathbf{z} \mid \mathbf{x})) \\ =& \arg \min _{\nu} \mathbb{E}_{q(\mathbf{z} \mid \mathbf{x} ; \nu)}\left[\log \frac{q(\mathbf{z} \mid \mathbf{x} ; \nu)}{p(\mathbf{z} \mid \mathbf{x})}\right] \\ =& \arg \min _{\nu} \mathbb{E}_{q(\mathbf{z} \mid \mathbf{x} ; \nu)}[\log q(\mathbf{z} \mid \mathbf{x} ; \nu)-\log p(\mathbf{x}, \mathbf{z})]+\log p(\mathbf{x}) \end{aligned}$

For the same reason as before, the KL divergence cannot be directly minimized because of the $\log p(\mathbf{x})$ term.
However, we can write

$\begin{aligned} & \arg \min _{\nu} \mathrm{KL}(q(\mathbf{z} \mid \mathbf{x} ; \nu)|| p(\mathbf{z} \mid \mathbf{x})) \\ =& \arg \min _{\nu} \log p(\mathbf{x})-\mathbb{E}_{q(\mathbf{z} \mid \mathbf{x} ; \nu)}[\log p(\mathbf{x}, \mathbf{z})-\log q(\mathbf{z} \mid \mathbf{x} ; \nu)] \\ =& \arg \max _{\nu} \underbrace{\mathbb{E}_{q(\mathbf{z} \mid \mathbf{x} ; \nu)}[\log p(\mathbf{x}, \mathbf{z})-\log q(\mathbf{z} \mid \mathbf{x} ; \nu)]}_{\operatorname{ELBO}(\mathbf{x} ; \nu)} \end{aligned}$
where $\text{ELBO}(\mathbf{x};\nu)$ is called the evidence lower bound objective.

Since $\log p(\mathbf{x})$ does not depend on $\nu$ , it can be considered as a constant, and minimizing the KL divergence is equivalent to maximizing the evidence lower bound, while being computationally tractable.
Given a dataset $\mathbf{d}=\left\{\mathbf{x}_{i} \mid i=1, \ldots, N\right\}$ , the final objective is the sum $\sum_{\left\{\mathbf{x}_{i} \in \mathbf{d}\right\}} \mathrm{ELBO}\left(\mathbf{x}_{i} ; \nu\right)$

Remark that
$\begin{aligned} \operatorname{ELBO}(\mathbf{x} ; \nu) &=\mathbb{E}_{q(\mathbf{z} ; \mid \mathbf{x} \nu)}[\log p(\mathbf{x}, \mathbf{z})-\log q(\mathbf{z} \mid \mathbf{x} ; \nu)] \\ &=\mathbb{E}_{q(\mathbf{z} \mid \mathbf{x} ; \nu)}[\log p(\mathbf{x} \mid \mathbf{z}) p(\mathbf{z})-\log q(\mathbf{z} \mid \mathbf{x} ; \nu)] \\ &=\mathbb{E}_{q(\mathbf{z} \mid \mathbf{x} ; \nu)}[\log p(\mathbf{x} \mid \mathbf{z})]-\operatorname{KL}(q(\mathbf{z} \mid \mathbf{x} ; \nu) \| p(\mathbf{z})) \end{aligned}$
Therefore, maximizing the ELBO:

encourages distributions to place their mass on configurations of latent variables that explain the observed data (first term);
encourages distributions close to the prior (second term).
Optimization
We want
$\begin{aligned} \nu^{*} &=\arg \max _{\nu} \mathrm{ELBO}(\mathbf{x} ; \nu) \\ &=\arg \max _{\nu} \mathbb{E}_{q(\mathbf{z} \mid \mathbf{x} ; \nu)}[\log p(\mathbf{x}, \mathbf{z})-\log q(\mathbf{z} \mid \mathbf{x} ; \nu)] \end{aligned}$
We can proceed by gradient ascent, provided we can evaluate $\nabla_{\nu} \operatorname{ELBO}(\mathbf{x} ; \nu)$ .

Evidence Lower Bound (ELBO)

A VAE learns stochastic mappings between an observed $\mathbf{x}$ -space, whose empirical distribution $q_{\mathcal{D}}(\mathbf{x})$ is typically complicated, and a latent $\mathbf{z}$ -space, whose distribution can be relatively simple (such as spherical).
The generative model learns a joint distribution $p_{\boldsymbol{\theta}}(\mathbf{x}, \mathbf{z})$ that is often (but not always) factorized as $p_{\boldsymbol{\theta}}(\mathbf{x}, \mathbf{z})=p_{\boldsymbol{\theta}}(\mathbf{z}) p_{\boldsymbol{\theta}}(\mathbf{x} \mid \mathbf{z})$ , with a prior distribution over latent space $p_{\boldsymbol{\theta}}(\mathbf{z})$ , and a schotastic decoder $p_{\boldsymbol{\theta}}(\mathbf{x} \mid \mathbf{z})$ . The schochastic encoder $q_{\phi}(\mathbf{z} \mid \mathbf{x})$ also called inference model, approximates the true but intractable posterior $p_{\boldsymbol{\theta}}(\mathbf{x} \mid \mathbf{z})$ of the generative model.

$\begin{aligned} \log p_{\boldsymbol{\theta}}(\mathbf{x}) &=\mathbb{E}_{q_{\phi}(\mathbf{z} \mid \mathbf{x})}\left[\log p_{\boldsymbol{\theta}}(\mathbf{x})\right] \\ &=\mathbb{E}_{q_{\phi}(\mathbf{z} \mid \mathbf{x})}\left[\log \left[\frac{p_{\boldsymbol{\theta}}(\mathbf{x}, \mathbf{z})}{p_{\boldsymbol{\theta}}(\mathbf{z} \mid \mathbf{x})}\right]\right] \\ &=\mathbb{E}_{q_{\phi}(\mathbf{z} \mid \mathbf{x})}\left[\log \left[\frac{p_{\boldsymbol{\theta}}(\mathbf{x}, \mathbf{z})}{q_{\phi}(\mathbf{z} \mid \mathbf{x})} \frac{q_{\phi}(\mathbf{z} \mid \mathbf{x})}{p_{\boldsymbol{\theta}}(\mathbf{z} \mid \mathbf{x})}\right]\right] \\ &=\underbrace{\mathbb{E}_{q_{\phi}(\mathbf{z} \mid \mathbf{x})}\left[\log \left[\frac{p_{\boldsymbol{\theta}}(\mathbf{x}, \mathbf{z})}{q_{\phi}(\mathbf{z} \mid \mathbf{x})}\right]\right]}_{=\mathcal{L}_{\boldsymbol{\theta}, \phi}(\mathbf{x})}+\underbrace{\mathbb{E}_{q_{\phi}(\mathbf{z} \mid \mathbf{x})}\left[\log \left[\frac{q_{\phi}(\mathbf{z} \mid \mathbf{x})}{p_{\boldsymbol{\theta}}(\mathbf{z} \mid \mathbf{x})}\right]\right]}_{=D_{K L}\left(q_{\phi}(\mathbf{z} \mid \mathbf{x}) \| p_{\boldsymbol{\theta}}(\mathbf{z} \mid \mathbf{x})\right)} \end{aligned}$
The second term is the Kullback-Leibler (KL) divergence between $q_{\phi}(\mathbf{z} \mid \mathbf{x})$ and $p_{\boldsymbol{\theta}}(\mathbf{z} \mid \mathbf{x})$ , which is non-negative:
$D_{K L}\left(q_{\phi}(\mathbf{z} \mid \mathbf{x}) \| p_{\boldsymbol{\theta}}(\mathbf{z} \mid \mathbf{x})\right) \geq 0$
and zero if, and only if, $q_{\phi}(\mathbf{z} \mid \mathbf{x})$ equals the true posterior distribution.

The first term is the variational lower bound, also called the evidence lower bound (ELBO):
$\mathcal{L}_{\theta, \phi}(\mathbf{x})=\mathbb{E}_{q_{\phi}(\mathbf{z} \mid \mathbf{x})}\left[\log p_{\boldsymbol{\theta}}(\mathbf{x}, \mathbf{z})-\log q_{\phi}(\mathbf{z} \mid \mathbf{x})\right]$

Due to the non-negativity of the KL divergence, the ELBO is a lower bound on the log-likelihood of the data.

$\begin{aligned} \mathcal{L}_{\boldsymbol{\theta}, \boldsymbol{\phi}}(\mathbf{x}) &=\log p_{\boldsymbol{\theta}}(\mathbf{x})-D_{K L}\left(q_{\boldsymbol{\phi}}(\mathbf{z} \mid \mathbf{x}) \| p_{\boldsymbol{\theta}}(\mathbf{z} \mid \mathbf{x})\right) \\ & \leq \log p_{\boldsymbol{\theta}}(\mathbf{x}) \end{aligned}$

So, interestingly, the KL divergence $D_{K L}\left(q_{\phi}(\mathbf{z} \mid \mathbf{x}) \| p_{\boldsymbol{\theta}}(\mathbf{z} \mid \mathbf{x})\right)$ determines two ’distances’:

By definition, the KL divergence of the approximate posterior from the true posterior;
The gap between the ELBO $\mathcal{L}_{\boldsymbol{\theta}, \boldsymbol{\phi}}(\mathbf{x})$ and the marginal likelihood $\log p_{\boldsymbol{\theta}}(\mathbf{x})$ ; this is also called the tightness of the bound. The better $q_{\phi}(\mathbf{z} \mid \mathbf{x})$ approximates the true (posterior) distribution $p_{\boldsymbol{\theta}}(\mathbf{z} \mid \mathbf{x})$ , in terms of the KL divergence, the smaller the gap.

Figure: Simple schematic of computational flow in a variational autoencoder.

Two for One
it can be understood that maximization of the ELBO $\mathcal{L}_{\boldsymbol{\theta}, \boldsymbol{\phi}}(\mathbf{x})$ w.r.t the parameters $\boldsymbol{\theta}, \boldsymbol{\phi}}$ , will concurrently optimize the two things we care about:

It will approximately maximize the marginal likelihood $p_{\boldsymbol{\theta}}(\mathbf{x})$ . This means that our generative model will become better.
It will minimize the KL divergence of the approximation $q_{\phi}(\mathbf{z} \mid \mathbf{x})$ from the true posterior $p_{\boldsymbol{\theta}}(\mathbf{z} \mid \mathbf{x})$ , so $q_{\phi}(\mathbf{z} \mid \mathbf{x})$ becomes better.

II. Variational Autoencoders

A variational auto-encoder is a deep latent variable model where:

The prior $p(\mathbf{z})$ is prescribed, and usually chosen to be Gaussian.
The likelihood $p(\mathbf{x} \mid \mathbf{z} ; \theta)$ is parameterized with a generative network $\text{NN}_\theta$ (or decoder) that takes as input $\mathbf{z}$ and outputs parameters $\phi = \text{NN}_\theta(\mathbf{z})$ to the data distribution.
$\begin{aligned} \mu, \sigma &=\mathbf{N N}_{\theta}(\mathbf{z}) \\ p(\mathbf{x} \mid \mathbf{z} ; \theta) &=\mathcal{N}\left(\mathbf{x} ; \mu, \sigma^{2} \mathbf{I}\right) \end{aligned}$
The approximate posterior $q(\mathbf{z} \mid \mathbf{x} ; \varphi)$ is parameterized with an inference network $\text{NN}_\varphi$ (or encoder) that takes as input $\mathbf{x}$ and outputs parameters $\nu = \text{NN}_\varphi(\mathbf{x})$ to the approximate posterior.
$\begin{aligned} \mu, \sigma &=\operatorname{NN}_{\varphi}(\mathbf{x}) \\ q(\mathbf{z} \mid \mathbf{x} ; \varphi) &=\mathcal{N}\left(\mathbf{z} ; \mu, \sigma^{2} \mathbf{I}\right) \end{aligned}$

Stochastic Gradient-Based Optimization of the ELBO

As before, we can use variational inference, but to jointly optimize the generative and the inference networks parameters $\theta$ and $\varphi$ .
We want
$\begin{aligned} \theta^{*}, \varphi^{*} &=\arg \max _{\theta, \varphi} \operatorname{ELBO}(\mathbf{x} ; \theta, \varphi) \\ &=\arg \max _{\theta, \varphi} \mathbb{E}_{q(\mathbf{z} \mid \mathbf{x} ; \varphi)}[\log p(\mathbf{x}, \mathbf{z} ; \theta)-\log q(\mathbf{z} \mid \mathbf{x} ; \varphi)] \\ &=\arg \max _{\theta, \varphi} \mathbb{E}_{q(\mathbf{z} \mid \mathbf{x} ; \varphi)}[\log p(\mathbf{x} \mid \mathbf{z} ; \theta)]-\operatorname{KL}(q(\mathbf{z} \mid \mathbf{x} ; \varphi) \| p(\mathbf{z})) \end{aligned}$

Given a dataset with i.i.d. data, the ELBO objective is the sum (or average) of individual-datapoint ELBO’s:

$\mathcal{L}_{\boldsymbol{\theta}, \phi}(\mathcal{D})=\sum_{\mathbf{x} \in \mathcal{D}} \mathcal{L}_{\boldsymbol{\theta}, \phi}(\mathbf{x})$

The individual-datapoint ELBO, and its gradient $\nabla_{\boldsymbol{\theta}, \phi} \mathcal{L}_{\boldsymbol{\theta}, \phi}(\mathbf{x})$ is, in general, intractable. However, good unbiased estimators $\tilde{\nabla}_{\boldsymbol{\theta}, \phi} \mathcal{L}_{\boldsymbol{\theta}, \phi}(\mathbf{x})$ exist, as we will show, such that we can still perform minibatch SGD.

Unbiased gradients of the ELBO w.r.t. the generative model parameters $\boldsymbol{\theta}}$ are simple to obtain:

$\begin{aligned} \nabla_{\boldsymbol{\theta}} \mathcal{L}_{\boldsymbol{\theta}, \boldsymbol{\phi}}(\mathbf{x}) &=\nabla_{\boldsymbol{\theta}} \mathbb{E}_{q_{\boldsymbol{\phi}}(\mathbf{z} \mid \mathbf{x})}\left[\log p_{\boldsymbol{\theta}}(\mathbf{x}, \mathbf{z})-\log q_{\boldsymbol{\phi}}(\mathbf{z} \mid \mathbf{x})\right] \\ &=\mathbb{E}_{q_{\boldsymbol{\phi}}(\mathbf{z} \mid \mathbf{x})}\left[\nabla_{\boldsymbol{\theta}}\left(\log p_{\boldsymbol{\theta}}(\mathbf{x}, \mathbf{z})-\log q_{\boldsymbol{\phi}}(\mathbf{z} \mid \mathbf{x})\right)\right] \\ & \simeq \nabla_{\boldsymbol{\theta}}\left(\log p_{\boldsymbol{\theta}}(\mathbf{x}, \mathbf{z})-\log q_{\boldsymbol{\phi}}(\mathbf{z} \mid \mathbf{x})\right) \\ &=\nabla_{\boldsymbol{\theta}}\left(\log p_{\boldsymbol{\theta}}(\mathbf{x}, \mathbf{z})\right) \end{aligned}$

Unbiased gradients w.r.t. the variational parameters $\boldsymbol{\phi}$ are more difficult to obtain, since the ELBO’s expectation is taken w.r.t. the distribution $q_{\boldsymbol{\phi}}(\mathbf{z} \mid \mathbf{x})$ , which is a function of $\boldsymbol{\phi}$ . I.e., in general:

$\begin{aligned} \nabla_{\boldsymbol{\phi}} \mathcal{L}_{\boldsymbol{\theta}, \boldsymbol{\phi}}(\mathbf{x}) &=\nabla_{\boldsymbol{\phi}} \mathbb{E}_{q_{\boldsymbol{\phi}}(\mathbf{z} \mid \mathbf{x})}\left[\log p_{\boldsymbol{\theta}}(\mathbf{x}, \mathbf{z})-\log q_{\boldsymbol{\phi}}(\mathbf{z} \mid \mathbf{x})\right] \\ & \neq \mathbb{E}_{q_{\phi}(\mathbf{z} \mid \mathbf{x})}\left[\nabla_{\boldsymbol{\phi}}\left(\log p_{\boldsymbol{\theta}}(\mathbf{x}, \mathbf{z})-\log q_{\boldsymbol{\phi}}(\mathbf{z} \mid \mathbf{x})\right)\right] \end{aligned}$

In the case of continuous latent variables, we can use a reparameterization trick for computing unbiased estimates of $\nabla_{\boldsymbol{\phi}} \mathcal{L}_{\boldsymbol{\theta}, \boldsymbol{\phi}}(\mathbf{x})$ , as we will now discuss.

Reparameterization Trick
For continuous latent variables and a differentiable encoder and generative model, the ELBO can be straightforwardly differentiated w.r.t. both $\boldsymbol{\phi}$ and $\boldsymbol{\theta}$ through a change of variables, also called the reparameterization trick.

Change of variables

First, we express the random variable $\mathbf{z} \sim q_{\phi}(\mathbf{z} \mid \mathbf{x})$ as some differentiable (and invertible) transformation of another random variable $\boldsymbol{\epsilon}$ , given $\mathbf{z}$ and $\boldsymbol{\phi}$ :
$\mathbf{z}=\mathbf{g}(\epsilon, \phi, \mathbf{x})$
where the distribution of random variable $\boldsymbol{\epsilon}$ is independent of $\mathbf{x}$ or $\boldsymbol{\phi}$ .

Gradient of expectation under change of variable
Given such a change of variable, expectations can be rewritten in terms of $\boldsymbol{\epsilon}$

$\mathbb{E}_{q_{\phi}(\mathbf{z} \mid \mathbf{x})}[f(\mathbf{z})]=\mathbb{E}_{p(\boldsymbol{\epsilon})}[f(\mathbf{z})]$

where $\mathbf{z}=\mathbf{g}(\boldsymbol{\epsilon}, \boldsymbol{\phi}, \mathbf{x})$ and the expectation and gradient operators become commutative, and we can form a simple Monte Carlo estimator:

$\begin{aligned} \nabla_{\phi} \mathbb{E}_{q_{\phi}(\mathbf{z} \mid \mathbf{x})}[f(\mathbf{z})] &=\nabla_{\phi} \mathbb{E}_{p(\boldsymbol{\epsilon})}[f(\mathbf{z})] \\ &=\mathbb{E}_{p(\boldsymbol{\epsilon})}\left[\nabla_{\phi} f(\mathbf{z})\right] \\ & \simeq \nabla_{\phi} f(\mathbf{z}) \end{aligned}$
where in the last line, $\mathbf{z}=\mathbf{g}(\phi, \mathbf{x}, \boldsymbol{\epsilon})$ with random noise sample $\boldsymbol{\epsilon} \sim p(\boldsymbol{\epsilon})$ .

Figure: Illustration of the reparameterization trick. The variational parameters $\boldsymbol{\phi}$ affect the objective through the random variable $\mathbf{z} \sim q_{\phi}(\mathbf{z} \mid \mathbf{x})$ . We wish to compute gradients $\nabla_{\phi} f$ to optimize the objective with SGD. In the original form (left), we cannot differentiate w.r.t. $\boldsymbol{\phi}$ , because we cannot directly backpropagate gradients through the random variable $\mathbf{z}$ . We can “externalize” the randomness in $\mathbf{z}$ by re-parameterizing the variable as a deterministic and differentiable function of $\boldsymbol{\phi}$ , $\mathbf{x}$ , and a newly introduced random variable $\mathbf{\epsilon}$ . This allows us to “backprop through $\mathbf{z}$ ”, and compute gradients $\nabla_{\phi} f$ .

For example, if $q(\mathbf{z}|\mathbf{x};\varphi) = \mathcal{N}(\mathbf{z}; \mu(\mathbf{x};\varphi), \sigma^2(\mathbf{x};\varphi))$ , where $\mu(\mathbf{x};\varphi)$ and $\sigma^2(\mathbf{x};\varphi))$ are the outputs of the inference network $NN_\varphi$ , then a common reparameterization is:

$\begin{aligned} p(\epsilon) &=\mathcal{N}(\epsilon ; \mathbf{0}, \mathbf{I}) \\ \mathbf{z} &=\mu(\mathbf{x} ; \varphi)+\sigma(\mathbf{x} ; \varphi) \odot \epsilon \end{aligned}$

Given such a change of variable, the ELBO can be rewritten as:

$\begin{aligned} \operatorname{ELBO}(\mathbf{x} ; \theta, \varphi) &=\mathbb{E}_{q(\mathbf{z} \mid \mathbf{x} ; \varphi)}[f(\mathbf{x}, \mathbf{z} ; \varphi)] \\ &=\mathbb{E}_{p(\epsilon)}[f(\mathbf{x}, g(\varphi, \mathbf{x}, \epsilon) ; \varphi)] \end{aligned}$

Therefore,

$\begin{aligned} \nabla_{\varphi} \mathrm{ELBO}(\mathbf{x} ; \theta, \varphi) &=\nabla_{\varphi} \mathbb{E}_{p(\epsilon)}[f(\mathbf{x}, g(\varphi, \mathbf{x}, \epsilon) ; \varphi)] \\ &=\mathbb{E}_{p(\epsilon)}\left[\nabla_{\varphi} f(\mathbf{x}, g(\varphi, \mathbf{x}, \epsilon) ; \varphi)\right] \end{aligned}$

which we can now estimate with Monte Carlo integration.

The last required ingredient is the evaluation of the likelihood $q(\mathbf{z}|\mathbf{x};\varphi)$ given the change of variable . As long as is invertible, we have:

$\log q(\mathbf{z} \mid \mathbf{x} ; \varphi)=\log p(\epsilon)-\log \left|\operatorname{det}\left(\frac{\partial \mathbf{z}}{\partial \epsilon}\right)\right|$

The Jacobian matrix contains all first derivatives of the transformation from $\mathbf{\epsilon}$ to $\mathbf{z}$ :

$\frac{\partial \mathbf{z}}{\partial \boldsymbol{\epsilon}}=\frac{\partial\left(z_{1}, \ldots, z_{k}\right)}{\partial\left(\epsilon_{1}, \ldots, \epsilon_{k}\right)}=\left(\begin{array}{ccc} \frac{\partial z_{1}}{\partial \epsilon_{1}} & \cdots & \frac{\partial z_{1}}{\partial \epsilon_{k}} \\ \vdots & \ddots & \vdots \\ \frac{\partial z_{k}}{\partial \epsilon_{1}} & \cdots & \frac{\partial z_{k}}{\partial \epsilon_{k}} \end{array}\right)$

Example
Consider the following setup:

Generative model:

$\begin{aligned} \mathbf{z} & \in \mathbb{R}^{d} \\ p(\mathbf{z}) &=\mathcal{N}(\mathbf{z} ; \mathbf{0}, \mathbf{I}) \\ p(\mathbf{x} \mid \mathbf{z} ; \theta) &=\mathcal{N}\left(\mathbf{x} ; \mu(\mathbf{z} ; \theta), \sigma^{2}(\mathbf{z} ; \theta) \mathbf{I}\right) \\ \mu(\mathbf{z} ; \theta) &=\mathbf{W}_{2}^{T} \mathbf{h}+\mathbf{b}_{2} \\ \log \sigma^{2}(\mathbf{z} ; \theta) &=\mathbf{W}_{3}^{T} \mathbf{h}+\mathbf{b}_{3} \\ \mathbf{h} &=\operatorname{ReLU}\left(\mathbf{W}_{1}^{T} \mathbf{z}+\mathbf{b}_{1}\right) \\ \theta &=\left\{\mathbf{W}_{1}, \mathbf{b}_{1}, \mathbf{W}_{2}, \mathbf{b}_{2}, \mathbf{W}_{3}, \mathbf{b}_{3}\right\} \end{aligned}$

Inference model

$\begin{aligned} q(\mathbf{z} \mid \mathbf{x} ; \varphi) &=\mathcal{N}\left(\mathbf{z} ; \mu(\mathbf{x} ; \varphi), \sigma^{2}(\mathbf{x} ; \varphi) \mathbf{I}\right) \\ p(\epsilon) &=\mathcal{N}(\epsilon ; \mathbf{0}, \mathbf{I}) \\ \mathbf{z} &=\mu(\mathbf{x} ; \varphi)+\sigma(\mathbf{x} ; \varphi) \odot \epsilon \\ \mu(\mathbf{x} ; \varphi) &=\mathbf{W}_{5}^{T} \mathbf{h}+\mathbf{b}_{5} \\ \log \sigma^{2}(\mathbf{x} ; \varphi) &=\mathbf{W}_{6}^{T} \mathbf{h}+\mathbf{b}_{6} \\ \mathbf{h} &\left.=\operatorname{ReLU}^{2} \mathbf{W}_{4}^{T} \mathbf{x}+\mathbf{b}_{4}\right) \\ \varphi &=\left\{\mathbf{W}_{4}, \mathbf{b}_{4}, \mathbf{W}_{5}, \mathbf{b}_{5}, \mathbf{W}_{6}, \mathbf{b}_{6}\right\} \end{aligned}$

Note that there is no restriction on the generative and inference network architectures. They could as well be arbitrarily complex convolutional networks.

Plugging everything together, the objective can be expressed as:

$\begin{aligned} \mathrm{ELBO}(\mathbf{x} ; \theta, \varphi) &=\mathbb{E}_{q(\mathbf{z} \mid \mathbf{x} ; \varphi)}[\log p(\mathbf{x}, \mathbf{z} ; \theta)-\log q(\mathbf{z} \mid \mathbf{x} ; \varphi)] \\ &=\mathbb{E}_{q(\mathbf{z} \mid \mathbf{x} ; \varphi)}[\log p(\mathbf{x} \mid \mathbf{z} ; \theta)]-\mathrm{KL}(q(\mathbf{z} \mid \mathbf{x} ; \varphi) \| p(\mathbf{z})) \\ &=\mathbb{E}_{p(\epsilon)}[\log p(\mathbf{x} \mid \mathbf{z}=g(\varphi, \mathbf{x}, \epsilon) ; \theta)]-\mathrm{KL}(q(\mathbf{z} \mid \mathbf{x} ; \varphi) \| p(\mathbf{z})) \end{aligned}$

where the KL divergence can be expressed analytically as

$\mathrm{KL}(q(\mathbf{z} \mid \mathbf{x} ; \varphi) \| p(\mathbf{z}))=\frac{1}{2} \sum_{j=1}^{d}\left(1+\log \left(\sigma_{j}^{2}(\mathbf{x} ; \varphi)\right)-\mu_{j}^{2}(\mathbf{x} ; \varphi)-\sigma_{j}^{2}(\mathbf{x} ; \varphi)\right)$

which allows to evaluate its derivative without approximation.

III. Sample Python implementation

# Base Variational Autoencoder 
from torch import nn
from abc import abstractmethod

class BaseVAE(nn.Module):
    
    def __init__(self) -> None:
        super(BaseVAE, self).__init__()

    def encode(self, input: Tensor) -> List[Tensor]:
        raise NotImplementedError

    def decode(self, input: Tensor) -> Any:
        raise NotImplementedError

    def sample(self, batch_size:int, current_device: int, **kwargs) -> Tensor:
        raise NotImplementedError

    def generate(self, x: Tensor, **kwargs) -> Tensor:
        raise NotImplementedError

    @abstractmethod
    def forward(self, *inputs: Tensor) -> Tensor:
        pass

    @abstractmethod
    def loss_function(self, *inputs: Any, **kwargs) -> Tensor:
        pass

class VanillaVAE(BaseVAE):


    def __init__(self,
                 in_channels: int,
                 latent_dim: int,
                 hidden_dims: List = None,
                 **kwargs) -> None:
        super(VanillaVAE, self).__init__()

        self.latent_dim = latent_dim

        modules = []
        if hidden_dims is None:
            hidden_dims = [32, 64, 128, 256, 512]

        # Build Encoder
        for h_dim in hidden_dims:
            modules.append(
                nn.Sequential(
                    nn.Conv2d(in_channels, out_channels=h_dim,
                              kernel_size= 3, stride= 2, padding  = 1),
                    nn.BatchNorm2d(h_dim),
                    nn.LeakyReLU())
            )
            in_channels = h_dim

        self.encoder = nn.Sequential(*modules)
        self.fc_mu = nn.Linear(hidden_dims[-1]*4, latent_dim)
        self.fc_var = nn.Linear(hidden_dims[-1]*4, latent_dim)


        # Build Decoder
        modules = []

        self.decoder_input = nn.Linear(latent_dim, hidden_dims[-1] * 4)

        hidden_dims.reverse()

        for i in range(len(hidden_dims) - 1):
            modules.append(
                nn.Sequential(
                    nn.ConvTranspose2d(hidden_dims[i],
                                       hidden_dims[i + 1],
                                       kernel_size=3,
                                       stride = 2,
                                       padding=1,
                                       output_padding=1),
                    nn.BatchNorm2d(hidden_dims[i + 1]),
                    nn.LeakyReLU())
            )



        self.decoder = nn.Sequential(*modules)

        self.final_layer = nn.Sequential(
                            nn.ConvTranspose2d(hidden_dims[-1],
                                               hidden_dims[-1],
                                               kernel_size=3,
                                               stride=2,
                                               padding=1,
                                               output_padding=1),
                            nn.BatchNorm2d(hidden_dims[-1]),
                            nn.LeakyReLU(),
                            nn.Conv2d(hidden_dims[-1], out_channels= 3,
                                      kernel_size= 3, padding= 1),
                            nn.Tanh())

    def encode(self, input: Tensor) -> List[Tensor]:
        """
        Encodes the input by passing through the encoder network
        and returns the latent codes.
        :param input: (Tensor) Input tensor to encoder [N x C x H x W]
        :return: (Tensor) List of latent codes
        """
        result = self.encoder(input)
        result = torch.flatten(result, start_dim=1)

        # Split the result into mu and var components
        # of the latent Gaussian distribution
        mu = self.fc_mu(result)
        log_var = self.fc_var(result)

        return [mu, log_var]

    def decode(self, z: Tensor) -> Tensor:
        """
        Maps the given latent codes
        onto the image space.
        :param z: (Tensor) [B x D]
        :return: (Tensor) [B x C x H x W]
        """
        result = self.decoder_input(z)
        result = result.view(-1, 512, 2, 2)
        result = self.decoder(result)
        result = self.final_layer(result)
        return result

    def reparameterize(self, mu: Tensor, logvar: Tensor) -> Tensor:
        """
        Reparameterization trick to sample from N(mu, var) from
        N(0,1).
        :param mu: (Tensor) Mean of the latent Gaussian [B x D]
        :param logvar: (Tensor) Standard deviation of the latent Gaussian [B x D]
        :return: (Tensor) [B x D]
        """
        std = torch.exp(0.5 * logvar)
        eps = torch.randn_like(std)
        return eps * std + mu

    def forward(self, input: Tensor, **kwargs) -> List[Tensor]:
        mu, log_var = self.encode(input)
        z = self.reparameterize(mu, log_var)
        return  [self.decode(z), input, mu, log_var]

    def loss_function(self,
                      *args,
                      **kwargs) -> dict:
        """
        Computes the VAE loss function.
        KL(N(\mu, \sigma), N(0, 1)) = \log \frac{1}{\sigma} + \frac{\sigma^2 + \mu^2}{2} - \frac{1}{2}
        :param args:
        :param kwargs:
        :return:
        """
        recons = args[0]
        input = args[1]
        mu = args[2]
        log_var = args[3]

        kld_weight = kwargs['M_N'] # Account for the minibatch samples from the dataset
        recons_loss =F.mse_loss(recons, input)


        kld_loss = torch.mean(-0.5 * torch.sum(1 + log_var - mu ** 2 - log_var.exp(), dim = 1), dim = 0)

        loss = recons_loss + kld_weight * kld_loss
        return {'loss': loss, 'Reconstruction_Loss':recons_loss.detach(), 'KLD':-kld_loss.detach()}

    def sample(self,
               num_samples:int,
               current_device: int, **kwargs) -> Tensor:
        """
        Samples from the latent space and return the corresponding
        image space map.
        :param num_samples: (Int) Number of samples
        :param current_device: (Int) Device to run the model
        :return: (Tensor)
        """
        z = torch.randn(num_samples,
                        self.latent_dim)

        z = z.to(current_device)

        samples = self.decode(z)
        return samples

    def generate(self, x: Tensor, **kwargs) -> Tensor:
        """
        Given an input image x, returns the reconstructed image
        :param x: (Tensor) [B x C x H x W]
        :return: (Tensor) [B x C x H x W]
        """

        return self.forward(x)[0]

IV. References

From Autoencoder to Beta-VAE,
https://lilianweng.github.io/posts/2018-08-12-vae/?fbclid=IwAR3-2GNNEw1dv7oHshWhMfB9F9wym4nMHtu1VZc1pOvlC1Qq_ZBOXN2H0ag
Diederik P. Kingma, Max Welling, An Introduction to Variational Autoencoders
Diederik P Kingma, Max Welling, Auto-Encoding Variational Bayes, ICLR 2014
Lecture 10: Auto-encoders and variational auto-encoders, https://github.com/glouppe/info8010-deep-learning