Variational Autoencoder
Introduction
In my previous article “Expectation Maximization Algorithm”, we discussed how to optimize probabilistic models with latent variables using the Expectation Maximization (EM) algorithm.
In this article, I would like to discuss how to optimize probabilistic models with latent variables using Variational Autoencoder (VAE).
Variational Autoencoder
Similar to the maximization - maximization procedure interpretation for the EM algorithm, the log evidence of a probabilistic model consisting of parameters $\boldsymbol{\theta}$, $\log p_{\boldsymbol{\theta}}(\mathbf{x})$, can be expressed as the sum of an evidence lower bound (ELBO) and a Kullback-Leibler (KL) divergence.
$$
\begin{align}
\log p_{\boldsymbol{\theta}}(\mathbf{x})
&= \mathbb{E}_{\mathbf{z} \sim q(\mathbf{z})} \left[ \log p_{\boldsymbol{\theta}}(\mathbf{x}) \right] \\
&= \mathbb{E}_{\mathbf{z} \sim q(\mathbf{z})} \left[ \log \frac{p_{\boldsymbol{\theta}}(\mathbf{x}, \mathbf{z})}{p_{\boldsymbol{\theta}}(\mathbf{z} | \mathbf{x})} \right] \\
&= \mathbb{E}_{\mathbf{z} \sim q(\mathbf{z})} \left[ \log \frac{p_{\boldsymbol{\theta}}(\mathbf{x}, \mathbf{z}) q(\mathbf{z})}{q(\mathbf{z}) p_{\boldsymbol{\theta}}(\mathbf{z} | \mathbf{x})} \right] \\
&= \mathbb{E}_{\mathbf{z} \sim q(\mathbf{z})} \left[ \log \frac{p_{\boldsymbol{\theta}}(\mathbf{x}, \mathbf{z})}{q(\mathbf{z})} + \log \frac{q(\mathbf{z})}{p_{\boldsymbol{\theta}}(\mathbf{z} | \mathbf{x})} \right] \\
&= \mathbb{E}_{\mathbf{z} \sim q(\mathbf{z})} \left[ \log \frac{p_{\boldsymbol{\theta}}(\mathbf{x}, \mathbf{z})}{q(\mathbf{z})} \right] + \mathbb{E}_{\mathbf{z} \sim q(\mathbf{z})} \left[ \log \frac{q(\mathbf{z})}{p_{\boldsymbol{\theta}}(\mathbf{z} | \mathbf{x})} \right] \\
&= \mathbb{E}_{\mathbf{z} \sim q(\mathbf{z})} \left[ \log \frac{p_{\boldsymbol{\theta}}(\mathbf{x}, \mathbf{z})}{q(\mathbf{z})} \right] - \mathbb{E}_{\mathbf{z} \sim q(\mathbf{z})} \left[ \log \frac{p_{\boldsymbol{\theta}}(\mathbf{z} | \mathbf{x})}{q(\mathbf{z})} \right] \\
&= \text{ELBO}(q, \boldsymbol{\theta}) + D_{\text{KL}}(q(\mathbf{z}) || p_{\boldsymbol{\theta}}(\mathbf{z} | \mathbf{x}))
\end{align}
$$
where $\text{ELBO}(q, \boldsymbol{\theta})$ is the evidence lower bound (ELBO) and $D_{\text{KL}}(q(\mathbf{z}) || p_{\boldsymbol{\theta}}(\mathbf{z} | \mathbf{x}))$ is the Kullback-Leibler (KL) divergence between the approximate posterior $q(\mathbf{z})$ and the true posterior $p_{\boldsymbol{\theta}}(\mathbf{z} | \mathbf{x})$.
In the iterative EM algorithm, the approximate posterior $q(\mathbf{z})$ is not learned explicitly. In the VAE, the approximate posterior $q(\mathbf{z})$ is learned explicitly using an inference model. Specifically, the approximate posterior $q(\mathbf{z})$ is parameterized by a neural network, $q_{\boldsymbol{\phi}}(\mathbf{z} | \mathbf{x})$, where $\boldsymbol{\phi}$ are the variational parameters of the neural network ($\boldsymbol{\phi}$ are called variational parameters because they are the parameters of the variational distribution $q(\mathbf{z} | \mathbf{x})$), conditioned on the observed data $\mathbf{x}$. Concretely,
$$
\begin{align}
\log p_{\boldsymbol{\theta}}(\mathbf{x})
&= \mathbb{E}_{\mathbf{z} \sim q_{\boldsymbol{\phi}}(\mathbf{z} | \mathbf{x})} \left[ \log \frac{p_{\boldsymbol{\theta}}(\mathbf{x}, \mathbf{z})}{q_{\boldsymbol{\phi}}(\mathbf{z} | \mathbf{x})} \right] + \mathbb{E}_{\mathbf{z} \sim q_{\boldsymbol{\phi}}(\mathbf{z} | \mathbf{x})} \left[ \log \frac{q_{\boldsymbol{\phi}}(\mathbf{z} | \mathbf{x})}{p_{\boldsymbol{\theta}}(\mathbf{z} | \mathbf{x})} \right] \\
&= \text{ELBO}(\boldsymbol{\phi}, \boldsymbol{\theta}) + D_{\text{KL}}(q_{\boldsymbol{\phi}}(\mathbf{z} | \mathbf{x}) || p_{\boldsymbol{\theta}}(\mathbf{z} | \mathbf{x}))
\end{align}
$$
where $\text{ELBO}(\boldsymbol{\phi}, \boldsymbol{\theta})$ is the evidence lower bound (ELBO) and $D_{\text{KL}}(q_{\boldsymbol{\phi}}(\mathbf{z} | \mathbf{x}) || p_{\boldsymbol{\theta}}(\mathbf{z} | \mathbf{x}))$ is the Kullback-Leibler (KL) divergence between the approximate posterior $q_{\boldsymbol{\phi}}(\mathbf{z} | \mathbf{x})$ and the true posterior $p_{\boldsymbol{\theta}}(\mathbf{z} | \mathbf{x})$.
Model Parameters Optimization
In the $\text{ELBO}(\boldsymbol{\phi}, \boldsymbol{\theta})$, $p_{\boldsymbol{\theta}}(\mathbf{x}, \mathbf{z})$ is called the generative model and $q_{\boldsymbol{\phi}}(\mathbf{z} | \mathbf{x})$ is called the inference model. $p_{\boldsymbol{\theta}}(\mathbf{x})$ is also the generative model, even though it is not explicitly modeled.
$D_{\text{KL}}(q_{\boldsymbol{\phi}}(\mathbf{z} | \mathbf{x}) || p_{\boldsymbol{\theta}}(\mathbf{z} | \mathbf{x}))$ is non-negative and it is zero if and only if the approximate posterior $q_{\boldsymbol{\phi}}(\mathbf{z} | \mathbf{x})$ is equal to the true posterior $p_{\boldsymbol{\theta}}(\mathbf{z} | \mathbf{x})$.
If the parameters $\boldsymbol{\theta}$ are fixed and only the parameters $\boldsymbol{\phi}$ are optimized, maximizing the ELBO is equivalent to minimizing the KL divergence between the approximate posterior $q_{\boldsymbol{\phi}}(\mathbf{z} | \mathbf{x})$ and the true posterior $p_{\boldsymbol{\theta}}(\mathbf{z} | \mathbf{x})$, making the approximate posterior $q_{\boldsymbol{\phi}}(\mathbf{z} | \mathbf{x})$ as close as possible to the true posterior $p_{\boldsymbol{\theta}}(\mathbf{z} | \mathbf{x})$.
If the parameters $\boldsymbol{\phi}$ are fixed and only the parameters $\boldsymbol{\theta}$ are optimized, maximizing the ELBO is approximately equivalent to maximizing the log evidence $\log p_{\boldsymbol{\theta}}(\mathbf{x})$, making the generative model $p_{\boldsymbol{\theta}}(\mathbf{x})$ better.
Therefore, the parameters $\boldsymbol{\phi}$ and $\boldsymbol{\theta}$ can even be jointly optimized by maximizing the ELBO so that the approximate posterior $q_{\boldsymbol{\phi}}(\mathbf{z} | \mathbf{x})$ becomes close to the true posterior $p_{\boldsymbol{\theta}}(\mathbf{z} | \mathbf{x})$ and the generative model $p_{\boldsymbol{\theta}}(\mathbf{x})$ becomes better.
Gradient-based optimization algorithms can be used to optimize the parameters $\boldsymbol{\phi}$ and $\boldsymbol{\theta}$. The target function is defined as follows.
$$
\begin{align}
\mathcal{L}_{\boldsymbol{\phi}, \boldsymbol{\theta}}(\mathbf{x})
&= \text{ELBO}(\boldsymbol{\phi}, \boldsymbol{\theta}) \\
&= \mathbb{E}_{\mathbf{z} \sim q_{\boldsymbol{\phi}}(\mathbf{z} | \mathbf{x})} \left[ \log \frac{p_{\boldsymbol{\theta}}(\mathbf{x}, \mathbf{z})}{q_{\boldsymbol{\phi}}(\mathbf{z} | \mathbf{x})} \right] \\
&= \mathbb{E}_{\mathbf{z} \sim q_{\boldsymbol{\phi}}(\mathbf{z} | \mathbf{x})} \left[ \log p_{\boldsymbol{\theta}}(\mathbf{x}, \mathbf{z}) - \log q_{\boldsymbol{\phi}}(\mathbf{z} | \mathbf{x}) \right]
\end{align}
$$
The gradients of the target function $\mathcal{L}_{\boldsymbol{\phi}, \boldsymbol{\theta}}(\mathbf{x})$ with respect to the generative model parameters $\boldsymbol{\theta}$ are straightforward to obtain.
$$
\begin{align}
\nabla_{\boldsymbol{\theta}} \mathcal{L}_{\boldsymbol{\phi}, \boldsymbol{\theta}}(\mathbf{x})
&= \nabla_{\boldsymbol{\theta}} \mathbb{E}_{\mathbf{z} \sim q_{\boldsymbol{\phi}}(\mathbf{z} | \mathbf{x})} \left[ \log p_{\boldsymbol{\theta}}(\mathbf{x}, \mathbf{z}) - \log q_{\boldsymbol{\phi}}(\mathbf{z} | \mathbf{x}) \right] \\
&= \mathbb{E}_{\mathbf{z} \sim q_{\boldsymbol{\phi}}(\mathbf{z} | \mathbf{x})} \left[ \nabla_{\boldsymbol{\theta}} \left( \log p_{\boldsymbol{\theta}}(\mathbf{x}, \mathbf{z}) - \log q_{\boldsymbol{\phi}}(\mathbf{z} | \mathbf{x}) \right) \right] \\
&= \mathbb{E}_{\mathbf{z} \sim q_{\boldsymbol{\phi}}(\mathbf{z} | \mathbf{x})} \left[ \nabla_{\boldsymbol{\theta}} \log p_{\boldsymbol{\theta}}(\mathbf{x}, \mathbf{z}) - \nabla_{\boldsymbol{\theta}} \log q_{\boldsymbol{\phi}}(\mathbf{z} | \mathbf{x}) \right] \\
&= \mathbb{E}_{\mathbf{z} \sim q_{\boldsymbol{\phi}}(\mathbf{z} | \mathbf{x})} \left[ \nabla_{\boldsymbol{\theta}} \log p_{\boldsymbol{\theta}}(\mathbf{x}, \mathbf{z}) \right]
\end{align}
$$
This $\nabla_{\boldsymbol{\theta}} \mathcal{L}_{\boldsymbol{\phi}, \boldsymbol{\theta}}(\mathbf{x})$ can be approximated using an unbiased Monte Carlo estimate by sampling $\mathbf{z}^{(i)} \sim q_{\boldsymbol{\phi}}(\mathbf{z} | \mathbf{x})$.
$$
\begin{align}
\nabla_{\boldsymbol{\theta}} \mathcal{L}_{\boldsymbol{\phi}, \boldsymbol{\theta}}(\mathbf{x})
&\approx \frac{1}{M} \sum_{i=1}^{M} \nabla_{\boldsymbol{\theta}} \log p_{\boldsymbol{\theta}}(\mathbf{x}, \mathbf{z}^{(i)})
\end{align}
$$
where $M$ is the number of samples which can be just 1.
The gradients of the target function $\mathcal{L}_{\boldsymbol{\phi}, \boldsymbol{\theta}}(\mathbf{x})$ with respect to the inference model parameters $\boldsymbol{\phi}$ are somewhat more complicated to obtain.
$$
\begin{align}
\nabla_{\boldsymbol{\phi}} \mathcal{L}_{\boldsymbol{\phi}, \boldsymbol{\theta}}(\mathbf{x})
&= \nabla_{\boldsymbol{\phi}} \mathbb{E}_{\mathbf{z} \sim q_{\boldsymbol{\phi}}(\mathbf{z} | \mathbf{x})} \left[ \log p_{\boldsymbol{\theta}}(\mathbf{x}, \mathbf{z}) - \log q_{\boldsymbol{\phi}}(\mathbf{z} | \mathbf{x}) \right] \\
\end{align}
$$
Here we encountered the same problem as we did in the EM algorithm. The probability distribution that the latent variables $\mathbf{z}$ follow is parameterized by $\boldsymbol{\phi}$, and the parameters to be optimized are also $\boldsymbol{\phi}$. Therefore, the Monte Carlo estimate of $\mathbb{E}_{\mathbf{z} \sim q_{\boldsymbol{\phi}}(\mathbf{z} | \mathbf{x})} \left[ \log p_{\boldsymbol{\theta}}(\mathbf{x}, \mathbf{z}) - \log q_{\boldsymbol{\phi}}(\mathbf{z} | \mathbf{x}) \right]$ by sampling $\mathbf{z}^{(i)} \sim q_{\boldsymbol{\phi}^{\text{old}}}(\mathbf{z} | \mathbf{x})$ is biased. In addition, even if the Monte Carlo estimate is unbiased, the sampling process is not differentiable with respect to $\boldsymbol{\phi}$ and the gradient methods would become invalid.
If the latent variables are transformed from some other latent variables that are not parameterized by $\boldsymbol{\phi}$ via a deterministic transformation, the unbiased Monte Carlo estimate can be obtained and the sampling process becomes differentiable with respect to $\boldsymbol{\phi}$. This is the idea of the reparameterization trick.
Concretely, we have some differentiable and invertible transformation $g$ that transforms some other latent variables $\boldsymbol{\epsilon}$ that are not parameterized by $\boldsymbol{\phi}$ into the latent variables $\mathbf{z}$ that are parameterized by $\boldsymbol{\phi}$.
$$
\begin{align}
\mathbf{z} &= g(\boldsymbol{\epsilon}, \boldsymbol{\phi}, \mathbf{x}) \\
\boldsymbol{\epsilon} &\sim p(\boldsymbol{\epsilon})
\end{align}
$$
The distribution of $\boldsymbol{\epsilon}$ is usually chosen to be a distribution $p(\boldsymbol{\epsilon})$ such that the distribution of $\mathbf{z}$ is the same as $q_{\boldsymbol{\phi}}(\mathbf{z} | \mathbf{x})$.
As a result, the target function $\mathcal{L}_{\boldsymbol{\phi}, \boldsymbol{\theta}}(\mathbf{x})$ can be expressed as an expectation over the latent variables $\boldsymbol{\epsilon}$ that are not parameterized by $\boldsymbol{\phi}$ and this target function is completely differentiable with respect to $\boldsymbol{\theta}$ and $\boldsymbol{\phi}$.
$$
\begin{align}
\mathcal{L}_{\boldsymbol{\phi}, \boldsymbol{\theta}}(\mathbf{x})
&= \mathbb{E}_{\mathbf{z} \sim q_{\boldsymbol{\phi}}(\mathbf{z} | \mathbf{x})} \left[ \log p_{\boldsymbol{\theta}}(\mathbf{x}, \mathbf{z}) - \log q_{\boldsymbol{\phi}}(\mathbf{z} | \mathbf{x}) \right] \\
&= \mathbb{E}_{\boldsymbol{\epsilon} \sim p(\boldsymbol{\epsilon})} \left[ \log p_{\boldsymbol{\theta}}(\mathbf{x}, \mathbf{z}) - \log q_{\boldsymbol{\phi}}(\mathbf{z} | \mathbf{x}) \right] \\
\end{align}
$$
where $\mathbf{z} = g(\boldsymbol{\epsilon}, \boldsymbol{\phi}, \mathbf{x})$.
The gradients of the target function $\mathcal{L}_{\boldsymbol{\phi}, \boldsymbol{\theta}}(\mathbf{x})$ with respect to the generative model parameters $\boldsymbol{\theta}$ and the inference model parameters $\boldsymbol{\phi}$ become straightforward to obtain.
$$
\begin{align}
\nabla_{\boldsymbol{\theta}} \mathcal{L}_{\boldsymbol{\phi}, \boldsymbol{\theta}}(\mathbf{x})
&= \nabla_{\boldsymbol{\theta}} \mathbb{E}_{\boldsymbol{\epsilon} \sim p(\boldsymbol{\epsilon})} \left[ \log p_{\boldsymbol{\theta}}(\mathbf{x}, \mathbf{z}) - \log q_{\boldsymbol{\phi}}(\mathbf{z} | \mathbf{x}) \right] \\
&= \mathbb{E}_{\boldsymbol{\epsilon} \sim p(\boldsymbol{\epsilon})} \left[ \nabla_{\boldsymbol{\theta}} \left( \log p_{\boldsymbol{\theta}}(\mathbf{x}, \mathbf{z}) - \log q_{\boldsymbol{\phi}}(\mathbf{z} | \mathbf{x}) \right) \right] \\
&= \mathbb{E}_{\boldsymbol{\epsilon} \sim p(\boldsymbol{\epsilon})} \left[ \nabla_{\boldsymbol{\theta}} \log p_{\boldsymbol{\theta}}(\mathbf{x}, \mathbf{z}) \right]
\end{align}
$$
$$
\begin{align}
\nabla_{\boldsymbol{\phi}} \mathcal{L}_{\boldsymbol{\phi}, \boldsymbol{\theta}}(\mathbf{x})
&= \nabla_{\boldsymbol{\phi}} \mathbb{E}_{\boldsymbol{\epsilon} \sim p(\boldsymbol{\epsilon})} \left[ \log p_{\boldsymbol{\theta}}(\mathbf{x}, \mathbf{z}) - \log q_{\boldsymbol{\phi}}(\mathbf{z} | \mathbf{x}) \right] \\
&= \mathbb{E}_{\boldsymbol{\epsilon} \sim p(\boldsymbol{\epsilon})} \left[ \nabla_{\boldsymbol{\phi}} \left( \log p_{\boldsymbol{\theta}}(\mathbf{x}, \mathbf{z}) - \log q_{\boldsymbol{\phi}}(\mathbf{z} | \mathbf{x}) \right) \right] \\
\end{align}
$$
According to the reparameterization trick and the transformations of random variables, the probability distribution $q_{\boldsymbol{\phi}}(\mathbf{z} | \mathbf{x})$ can be transformed from the probability distribution $p(\boldsymbol{\epsilon})$.
$$
\begin{align}
q_{\boldsymbol{\phi}}(\mathbf{z} | \mathbf{x}) &= \frac{p(\boldsymbol{\epsilon})}{\left| \det \mathbf{J}_g(\boldsymbol{\epsilon}) \right|}
\end{align}
$$
where $\mathbf{J}_g(\boldsymbol{\epsilon})$ is the Jacobian matrix of the transformation $g$ at $\boldsymbol{\epsilon}$.
$$
\begin{align}
\mathbf{J}_g(\boldsymbol{\epsilon})
&= \frac{\partial \mathbf{z}}{\partial \boldsymbol{\epsilon}} \\
&= \begin{bmatrix}
\frac{\partial z_1}{\partial \epsilon_1} & \frac{\partial z_1}{\partial \epsilon_2} & \cdots & \frac{\partial z_1}{\partial \epsilon_n} \\
\frac{\partial z_2}{\partial \epsilon_1} & \frac{\partial z_2}{\partial \epsilon_2} & \cdots & \frac{\partial z_2}{\partial \epsilon_n} \\
\vdots & \vdots & \ddots & \vdots \\
\frac{\partial z_n}{\partial \epsilon_1} & \frac{\partial z_n}{\partial \epsilon_2} & \cdots & \frac{\partial z_n}{\partial \epsilon_n}
\end{bmatrix}
\end{align}
$$
Thus, the target function $\mathcal{L}_{\boldsymbol{\phi}, \boldsymbol{\theta}}(\mathbf{x})$ can be rewritten as follows.
$$
\begin{align}
\mathcal{L}_{\boldsymbol{\phi}, \boldsymbol{\theta}}(\mathbf{x})
&= \mathbb{E}_{\boldsymbol{\epsilon} \sim p(\boldsymbol{\epsilon})} \left[ \log p_{\boldsymbol{\theta}}(\mathbf{x}, \mathbf{z}) - \log q_{\boldsymbol{\phi}}(\mathbf{z} | \mathbf{x}) \right] \\
&= \mathbb{E}_{\boldsymbol{\epsilon} \sim p(\boldsymbol{\epsilon})} \left[ \log p_{\boldsymbol{\theta}}(\mathbf{x}, \mathbf{z}) - \log \frac{p(\boldsymbol{\epsilon})}{\left| \det \mathbf{J}_g(\boldsymbol{\epsilon}) \right|} \right] \\
&= \mathbb{E}_{\boldsymbol{\epsilon} \sim p(\boldsymbol{\epsilon})} \left[ \log p_{\boldsymbol{\theta}}(\mathbf{x}, \mathbf{z}) - \log p(\boldsymbol{\epsilon}) + \log \left| \det \mathbf{J}_g(\boldsymbol{\epsilon}) \right| \right] \\
&= \mathbb{E}_{\boldsymbol{\epsilon} \sim p(\boldsymbol{\epsilon})} \left[ \log \left( p_{\boldsymbol{\theta}}(\mathbf{x} | \mathbf{z}) p_{\boldsymbol{\theta}}(\mathbf{z}) \right) - \log p(\boldsymbol{\epsilon}) + \log \left| \det \mathbf{J}_g(\boldsymbol{\epsilon}) \right| \right] \\
&= \mathbb{E}_{\boldsymbol{\epsilon} \sim p(\boldsymbol{\epsilon})} \left[ \log p_{\boldsymbol{\theta}}(\mathbf{x} | \mathbf{z}) + \log p_{\boldsymbol{\theta}}(\mathbf{z}) - \log p(\boldsymbol{\epsilon}) + \log \left| \det \mathbf{J}_g(\boldsymbol{\epsilon}) \right| \right] \\
&= \mathbb{E}_{\boldsymbol{\epsilon} \sim p(\boldsymbol{\epsilon})} \left[ \log p_{\boldsymbol{\theta}}(\mathbf{x} | g(\boldsymbol{\epsilon}, \boldsymbol{\phi}, \mathbf{x})) + \log p_{\boldsymbol{\theta}}(g(\boldsymbol{\epsilon}, \boldsymbol{\phi}, \mathbf{x})) - \log p(\boldsymbol{\epsilon}) + \log \left| \det \mathbf{J}_g(\boldsymbol{\epsilon}) \right| \right] \\
&= \mathbb{E}_{\boldsymbol{\epsilon} \sim p(\boldsymbol{\epsilon})} \left[ \log p_{\boldsymbol{\theta}}(\mathbf{x} | g(\boldsymbol{\epsilon}, \boldsymbol{\phi}, \mathbf{x})) + \log p_{\boldsymbol{\theta}}(g(\boldsymbol{\epsilon}, \boldsymbol{\phi}, \mathbf{x})) + \log \left| \det \mathbf{J}_g(\boldsymbol{\epsilon}) \right| \right] + \text{constant} \\
\end{align}
$$
where $\text{constant}$ is just $\mathbb{E}_{\boldsymbol{\epsilon} \sim p(\boldsymbol{\epsilon})} \left[- \log p(\boldsymbol{\epsilon})\right]$ which does not depend on $\boldsymbol{\phi}$ and $\boldsymbol{\theta}$.
The gradients of the target function $\mathcal{L}_{\boldsymbol{\phi}, \boldsymbol{\theta}}(\mathbf{x})$ with respect to the generative model parameters $\boldsymbol{\theta}$ and the inference model parameters $\boldsymbol{\phi}$ can be approximated using an unbiased Monte Carlo estimate by sampling $\boldsymbol{\epsilon}^{(i)} \sim p(\boldsymbol{\epsilon})$ and easily computed in automatic differentiation frameworks, such as PyTorch and TensorFlow.
Generative Model Evaluation
To evaluate the performance of the generative model $p_{\boldsymbol{\theta}}(\mathbf{x})$, the marginal likelihood $\log p_{\boldsymbol{\theta}}(\mathbf{x})$ can be estimated via importance sampling.
$$
\begin{align}
\log p_{\boldsymbol{\theta}}(\mathbf{x})
&= \log \mathbb{E}_{\mathbf{z} \sim q_{\boldsymbol{\phi}}(\mathbf{z} | \mathbf{x})} \left[ \frac{p_{\boldsymbol{\theta}}(\mathbf{x}, \mathbf{z})}{q_{\boldsymbol{\phi}}(\mathbf{z} | \mathbf{x})} \right] \\
&\approx \log \frac{1}{M} \sum_{i=1}^{M} \frac{p_{\boldsymbol{\theta}}(\mathbf{x}, \mathbf{z}^{(i)})}{q_{\boldsymbol{\phi}}(\mathbf{z}^{(i)} | \mathbf{x})}
\end{align}
$$
where $\mathbf{z}^{(i)} \sim q_{\boldsymbol{\phi}}(\mathbf{z} | \mathbf{x})$ and $M$ is the number of samples.
One can also sample from the generative model $p_{\boldsymbol{\theta}}(\mathbf{x}, \mathbf{z})$ and evaluate the quality of the generated sample $\mathbf{x}$ using some other methods, such as visual inspection. We will first sample the latent variables $\mathbf{z}$ from the prior distribution $p_{\boldsymbol{\theta}}(\mathbf{z})$ and then sample the observed data $\mathbf{x}$ from the generative model $p_{\boldsymbol{\theta}}(\mathbf{x} | \mathbf{z})$.
Increasing the Inference Model Flexibility
What if the inference model $q_{\boldsymbol{\phi}}(\mathbf{z} | \mathbf{x})$ is not flexible enough to approximate the true posterior $p_{\boldsymbol{\theta}}(\mathbf{z} | \mathbf{x})$?
The tightness of the ELBO depends on the flexibility of the approximate posterior $q_{\boldsymbol{\phi}}(\mathbf{z} | \mathbf{x})$. If $q_{\boldsymbol{\phi}}(\mathbf{z} | \mathbf{x})$ is not flexible enough to approximate the true posterior $p_{\boldsymbol{\theta}}(\mathbf{z} | \mathbf{x})$, the ELBO may not be tight enough to approximate the log evidence $\log p_{\boldsymbol{\theta}}(\mathbf{x})$ well. In this case, the generative model $p_{\boldsymbol{\theta}}(\mathbf{x})$ may not be learned well.
In many simple settings, the approximate posterior $q_{\boldsymbol{\phi}}(\mathbf{z} | \mathbf{x})$ is just a simple multivariate Gaussian distribution, and it is insufficient to approximate the true posterior $p_{\boldsymbol{\theta}}(\mathbf{z} | \mathbf{x})$ if true posterior $p_{\boldsymbol{\theta}}(\mathbf{z} | \mathbf{x})$ is more complex.
Similar to what we have discussed in the article “Expectation Maximization Algorithm”, having more hierarchical latent variables can make the approximate posterior $q_{\boldsymbol{\phi}}(\mathbf{z} | \mathbf{x})$ more flexible and the ELBO tighter. In our case, we will introduce more hierarchical latent variables to both the generative model and the inference model.
Suppose we have $n$ sets of latent variables $\mathbf{z}_{1}, \mathbf{z}_{2}, \cdots, \mathbf{z}_{n}$ and the generative model $p_{\boldsymbol{\theta}}(\mathbf{x}, \mathbf{z}_{1}, \mathbf{z}_{2}, \cdots, \mathbf{z}_{n})$ and the inference model $q_{\boldsymbol{\phi}}(\mathbf{z}_{1}, \mathbf{z}_{2}, \cdots, \mathbf{z}_{n} | \mathbf{x})$. The ELBO can be expressed as follows.
$$
\begin{align}
\text{ELBO}(\boldsymbol{\phi}, \boldsymbol{\theta})
&= \mathbb{E}_{\mathbf{z}_{1}, \mathbf{z}_{2}, \cdots, \mathbf{z}_{n} \sim q_{\boldsymbol{\phi}}(\mathbf{z}_{1}, \mathbf{z}_{2}, \cdots, \mathbf{z}_{n} | \mathbf{x})} \left[ \log p_{\boldsymbol{\theta}}(\mathbf{x}, \mathbf{z}_{1}, \mathbf{z}_{2}, \cdots, \mathbf{z}_{n}) - \log q_{\boldsymbol{\phi}}(\mathbf{z}_{1}, \mathbf{z}_{2}, \cdots, \mathbf{z}_{n} | \mathbf{x}) \right]
\end{align}
$$
In the generative model, we can introduce more hierarchical latent variables by factorizing the joint distribution of the observed data $\mathbf{x}$ and the latent variables $\mathbf{z}_{1}, \mathbf{z}_{2}, \cdots, \mathbf{z}_{n}$.
$$
\begin{align}
p_{\boldsymbol{\theta}}(\mathbf{x}, \mathbf{z}_{1}, \mathbf{z}_{2}, \cdots, \mathbf{z}_{n})
&= p_{\boldsymbol{\theta}}(\mathbf{x} | \mathbf{z}_{1}, \mathbf{z}_{2}, \cdots, \mathbf{z}_{n}) p_{\boldsymbol{\theta}}(\mathbf{z}_{1}, \mathbf{z}_{2}, \cdots, \mathbf{z}_{n}) \\
&= p_{\boldsymbol{\theta}}(\mathbf{x} | \mathbf{z}_{1}, \mathbf{z}_{2}, \cdots, \mathbf{z}_{n}) p_{\boldsymbol{\theta}}(\mathbf{z}_{1} | \mathbf{z}_{2}, \cdots, \mathbf{z}_{n}) p_{\boldsymbol{\theta}}(\mathbf{z}_{2}, \mathbf{z}_{3}, \cdots, \mathbf{z}_{n}) \\
&\quad \vdots \\
&= p_{\boldsymbol{\theta}}(\mathbf{x} | \mathbf{z}_{1}, \mathbf{z}_{2}, \cdots, \mathbf{z}_{n}) p_{\boldsymbol{\theta}}(\mathbf{z}_{1} | \mathbf{z}_{2}, \cdots, \mathbf{z}_{n}) p_{\boldsymbol{\theta}}(\mathbf{z}_{2} | \mathbf{z}_{3}, \cdots, \mathbf{z}_{n}) \cdots p_{\boldsymbol{\theta}}(\mathbf{z}_{n})
\end{align}
$$
Given the generative model, there can be different ways to introduce more hierarchical latent variables to the inference model.
One common way is to follow the same hierarchy as the generative model.
$$
\begin{align}
q_{\boldsymbol{\phi}}(\mathbf{z}_{1}, \mathbf{z}_{2}, \cdots, \mathbf{z}_{n} | \mathbf{x})
&= q_{\boldsymbol{\phi}}(\mathbf{z}_{1} | \mathbf{z}_{2}, \cdots, \mathbf{z}_{n}, \mathbf{x}) q_{\boldsymbol{\phi}}(\mathbf{z}_{2} | \mathbf{z}_{3}, \cdots, \mathbf{z}_{n}, \mathbf{x}) \cdots q_{\boldsymbol{\phi}}(\mathbf{z}_{n} | \mathbf{x})
\end{align}
$$
In the first step, the inference model will sample the latent variables $\mathbf{z}_{n}$ from the approximate posterior $q_{\boldsymbol{\phi}}(\mathbf{z}_{n} | \mathbf{x})$, and the sampled latent variables $\mathbf{z}_{n}$ can be used to compute the prior $p_{\boldsymbol{\theta}}(\mathbf{z}_{n})$.
In the second step, the inference model will sample the latent variables $\mathbf{z}_{n-1}$ from the approximate posterior $q_{\boldsymbol{\phi}}(\mathbf{z}_{n-1} | \mathbf{z}_{n}, \mathbf{x})$, and the sampled latent variables $\mathbf{z}_{n-1}$ and $\mathbf{z}_{n}$ can be used to compute the prior $p_{\boldsymbol{\theta}}(\mathbf{z}_{n-1} | \mathbf{z}_{n})$.
In the $n$-th step, the inference model will sample the latent variables $\mathbf{z}_{1}$ from the approximate posterior $q_{\boldsymbol{\phi}}(\mathbf{z}_{1} | \mathbf{z}_{2}, \cdots, \mathbf{z}_{n}, \mathbf{x})$, and the sampled latent variables $\mathbf{z}_{1}, \mathbf{z}_{2}, \cdots, \mathbf{z}_{n}$ can be used to compute the prior $p_{\boldsymbol{\theta}}(\mathbf{z}_{1} | \mathbf{z}_{2}, \cdots, \mathbf{z}_{n})$.
This kind of inference model is sometimes also called the top-down inference model.
The other common way is reverse the hierarchy of the generative model.
$$
\begin{align}
q_{\boldsymbol{\phi}}(\mathbf{z}_{1}, \mathbf{z}_{2}, \cdots, \mathbf{z}_{n} | \mathbf{x})
&= q_{\boldsymbol{\phi}}(\mathbf{z}_{n} | \mathbf{x}, \mathbf{z}_{1}, \mathbf{z}_{2}, \cdots, \mathbf{z}_{n-1}) q_{\boldsymbol{\phi}}(\mathbf{z}_{n-1} | \mathbf{x}, \mathbf{z}_{1}, \mathbf{z}_{2}, \cdots, \mathbf{z}_{n-2}) \cdots q_{\boldsymbol{\phi}}(\mathbf{z}_{1} | \mathbf{x})
\end{align}
$$
In the first step, the inference model will sample the latent variables $\mathbf{z}_{1}$ from the approximate posterior $q_{\boldsymbol{\phi}}(\mathbf{z}_{1} | \mathbf{x})$.
In the second step, the inference model will sample the latent variables $\mathbf{z}_{2}$ from the approximate posterior $q_{\boldsymbol{\phi}}(\mathbf{z}_{2} | \mathbf{x}, \mathbf{z}_{1})$.
In the $n$-th step, the inference model will sample the latent variables $\mathbf{z}_{n}$ from the approximate posterior $q_{\boldsymbol{\phi}}(\mathbf{z}_{n} | \mathbf{x}, \mathbf{z}_{1}, \mathbf{z}_{2}, \cdots, \mathbf{z}_{n-1})$.
Once all the latent variables $\mathbf{z}_{1}, \mathbf{z}_{2}, \cdots, \mathbf{z}_{n}$ are sampled, the sampled latent variables can be used to compute the prior $p_{\boldsymbol{\theta}}(\mathbf{z}_{1}, \mathbf{z}_{2}, \cdots, \mathbf{z}_{n})$ in the generative model.
This kind of inference model is sometimes also called the bottom-up inference model.
It seems that the top-down inference model is more favorable than the bottom-up inference model in practice.
References
Variational Autoencoder