Lei Mao bio photo

Lei Mao

Machine Learning, Artificial Intelligence, Computer Science.

Twitter Facebook LinkedIn GitHub   G. Scholar E-Mail RSS


Variational inference is a method of approximating a conditional density of latent variables given observed variables. It has also laid the foundation for Bayesian deep learning. There are many kinds of literature and documentation on this topic online. However, I found they missed a lot of details, especially for the derivations. In this blog post, I have documented a full tutorial for variational inference with all the derivation details and a concrete example.

Bayesian Inference

If we have a set of observed random variables $\boldsymbol{x}$ = {$x_1, x_2, \cdots, x_n$} and $\boldsymbol{z}$ = {$z_1, z_2, \cdots, z_m$} latent random variables, we are interested to compute the posterior $p(\boldsymbol{z} | \boldsymbol{x})$.

Using Bayes’ theorem, we have

$p(\boldsymbol{x})$ is the marginal density which is also called evidence.

However, in most of the models, computing $p(x)$ could be intractable. Here I have given an concrete example below showing that calculating the evidence from a simple model is intractable.

Bayesian Inference Could Be Intractable

Consider a mixture of $K$ unit-variance ($\sigma^2 = 1$) univariate Gaussians with means $\boldsymbol{\mu}$ = { $\mu_1, \mu_2, \cdots, \mu_K $}. The means were independently drawn from a common prior $p(\mu_k)$ which we assume to be a Gaussian $\mathcal{N}(0, \sigma^2)$ (yes, another Gaussian). Here $\mu_k$ is a random variable, prior variance $\sigma^2$ is a hyperparameter, and $\boldsymbol{\mu}$ is a column vector.

To generate $n$ observations {$x_1, x_2, \cdots, x_n$}, for observation $x_i$, we first choose a cluster assignment $c_i$ which indicates which Gaussian distribution is $x_i$ drawn from. Here $c_i$ is a random variable drawn from categorical distribution (yes, one of the simplest expressive distributions) over {$1,2,\cdots,K$}. Without generally, we also assume that it is a even categorical distribution, i.e., the prior $p(c_i) = 1/K$ for all $K$ clusters for all samples. $c_i$ is encoded as a vector of length $K$, where all the elements are zero except the one corresponding to the sampled cluster assignment is one. For example, if $K=5$ and Gaussian mixture #3 was sampled for observation $x_2$, we have $c_2$ = {$0,0,1,0,0$}. The mean of the sampled Gaussian could be expressed as the dot product of $c_i$ and $\mu$, i.e., $x_i | c_i, \boldsymbol{\mu} \sim \mathcal{N}(c_i^{\top} \boldsymbol{\mu}, 1)$

Mathematically, we have the assumed model which contains the prior information

To calculate the joint probability density of latent variable ($\boldsymbol{\mu}$ = { $\mu_1, \mu_2, \cdots, \mu_K $}, $\boldsymbol{c}$ = { $c_1, c_2, \cdots, c_n $}) and observation ($\boldsymbol{x}$ = { $x_1, x_2, \cdots, x_n $}), using chain rules, we have

Note that $\boldsymbol{\mu}, \boldsymbol{c}, \boldsymbol{x}$ are all random variables, and latent variables are $\boldsymbol{z}$ = {$\boldsymbol{\mu}, \boldsymbol{c}$}. I should emphasize here again that this is the model (prior) we assumed for the observations. It is not and should not be necessary that the observations were generated from the model we assumed.

Our goal is to calculate the posterior $p(\boldsymbol{z} | \boldsymbol{x})$. Using Bayes’ theorem, we have


Since the possible values of random variable $c_i$ is {$0,1,\cdots,K$}, if we expand $\prod_{i=1}^{n} \sum_{c_i} p(c_i)p(x_i |c_i,\boldsymbol{\mu})$, there will be $K^n$ terms, so we will need to $K^n$ independent integrals in order to compute $p(x)$. Therefore, the complexity of computing $p(x)$ is $O(K^n)$ which is intractable.

It should also be noted that to compute $p(x)$ we could also reverse the order of integrals.

However, the complexity of computing $p(x)$ is still $O(K^n)$ because there are $K^n$ different configurations of $\boldsymbol{c}$ and we still have to do $K^n$ summations.

Please think twice about the above derivations because sometimes it is brain-twisting.

The conclusion is computing the posterior $p(\boldsymbol{z} | \boldsymbol{x})$ using Bayesian inference is intractable in this case.

Variational Inference

Since directly computing $p(\boldsymbol{z} | \boldsymbol{x})$ might not be feasible, we have to do some approximate inference. Variational inference leverages optimization to find the best distribution among a family of distributions parameterized by free “variational parameters” which approximates the posterior distribution $p(\boldsymbol{z} | \boldsymbol{x})$. The optimization finds the best distribution by finding the best settings of the parameters for the family of distributions. It should be noted that there are other non-optimization based methods to do such approximate inference, such as MCMC. It has advantages and drawbacks compared to variational inference. However, we are not going to elaborate here.

Optimization Goal

In variational inference, we specify a family of distributions $\mathscr{L}$ over the latent random variables. Each $q(\boldsymbol{z}) \in \mathscr{L}$ is a candidate approximation to the posterior. Our goal is to find the best candidate who has the smallest KL divergence to the posterior we want to compute.

Mathematically, our optimization goal is

where $q^{*}(\cdot)$ is the best approximation to the posterior in distribution family $\mathscr{L}$.

Let us take a look at the KL divergence properties first.

Note that $\mathbb{E}_{q} \big[ \log p(\boldsymbol{x}) \big] = \log p(\boldsymbol{x})$ because the configuration of $\boldsymbol{z}$ will not affect the prior distribution of $\boldsymbol{x}$.

Because $D_{\text{KL}} \big( q(\boldsymbol{z}) || p(\boldsymbol{z} | \boldsymbol{x}) \big)$ contains a term $\log p(\boldsymbol{x})$, optimizing against $D_{\text{KL}} \big( q(\boldsymbol{z}) || p(\boldsymbol{z} | \boldsymbol{x}) \big)$ is not tractable. However, because $\log p(\boldsymbol{x})$ is a constant, so we could just ignore it during optimization.

We define evidence lower bound (ELBO) as

This value is called evidence lower bound because it is the lower bound of logarithmic evidence.

Remember that KL divergence is strictly non-negative. There are many ways to prove this and you can find a lot of simple proofs online.

Minimizing $D_{\text{KL}} \big( q(\boldsymbol{z}) || p(\boldsymbol{z} | \boldsymbol{x}) \big)$ is equivalent to maximizing $\text{ELBO}(q)$.

Mean-Field Variational Family

Before we start to do optimization, the variational family has to be chosen. One of the most generic variational family is mean-field.

Each latent variable $z_j$ is governed by its own variational factors, the distribution $q_j(z_j)$. That is to say, each latent variable is sampled from its corresponding distribution from the same variational family. All the distributions were independent of each other and we have to optimize each of them. The final optimal distribution which approximates the posterior $p(\boldsymbol{z} | \boldsymbol{x})$ is the product density of all the independent distributions.

Taking the previous Bayesian mixture of Gaussians as an example, in that case, using mean-field, we have

The factor $q(\mu_k;m_k,s_k^2)$ is a Gaussian distribution for $\mu_k$ which is the mean of the $k$th Gaussian in the mixture. The factor $q(c_i; \varphi_i)$ is a distribution for $c_i$ which is the assigned cluster ID for the Gaussian for observation $x_i$ sampling. $\varphi_i$, a vector of length $K$, is the probabilities of selecting each Gaussian for observation $x_i$ sampling. In practice, we could only reach a local optimum, and $q(\boldsymbol{\mu}, \boldsymbol{c})$ will be as close to $p(\boldsymbol{\mu}, \boldsymbol{c} | \boldsymbol{x})$ as possible.

One of the drawbacks of mean-field variational approximation is that it could not catch the correlation between the latent variables as we have independence assumptions in the definition.

Coordinate Ascent Optimization

Using mean-field variational approximation, we can decompose the terms in ELBO as

Because $p(\boldsymbol{x}, \boldsymbol{z}) = p(\boldsymbol{x}) \prod_{i=1}^{m} p(z_i | z_{1:(i-1)}, \boldsymbol{x})$ according to the conditional probability chain rule,

Therefore, ELBO could be written as

We further derive the coordinate ascent update for the distribution $q_j(z_j)$ of a latent variable $z_j$ once a time while keeping all the other distributions fixed. With the respect to $q_j(z_j)$, we put $z_j$ at the end of the variable list, that is to say,

We rewrite ELBO respective to $q_j(z_j)$,

Because the rightmost terms have nothing to do with $q_j(z_j)$, we further have the optimization problem.

It should be noted that some symbols used above are equivalent, such as $q(z_j) = q_j(z_j)$.

This optimization problem is subject to constrain $\int \limits_{z_j} q_j(z_j)d z_j = 1$.

Using Lagrange multiplier and Calculus for variations, it is equivalent to get the local optimum of the following formula.

Take the functional derivative of $F(q_j)$,

We want $\frac{\delta F(q_j)}{\delta q_j} = 0$,

We also want $\frac{d F(q_j)}{d \lambda} = 0$,


Because of the conditional probability,

Because $\frac{p(\boldsymbol{z}, \boldsymbol{x})}{p(z_{-j}, \boldsymbol{x})}$ does not involve $z_j$, we could also equivalently derive


The above update formulas could also be derived without using Calculus for variations. Similar to how we derived $\text{ELBO}(q_j)$:

We define a new distribution


Because KL divergence is non-negative, when $q_j(z_j) = \tilde{p_j}(z_j, \boldsymbol{x})$, $\text{KL}\big( q_j(z_j) || \tilde{p_j}(z_j, \boldsymbol{x}) \big) = 0$, $\text{ELBO}(q_j)$ is maximized.

$q_j(z_j) = \tilde{p_j}(z_j, \boldsymbol{x})$ is equivalent to

Coordinate Ascent Pseudo-Code for Variational Inference

Coordinate Ascent for Variational Inference

In practice, when we update parameters of distribution $q_j$ using the expression in the pseudo-code, we may have to normalize it against the sum of exponential terms from all the $m$ latent variables, i.e., the sum is the denominator.

Mean-Field Variational Inference for Mixture of Gaussians

Recall the definition of ELBO is:

To compute the ELBO for the mixture of Gaussians example we have with mean-field assumptions:

Each term in the ELBO has a closed form of solution. Note that $\log p(\mu_k)$ and $\log p(c_i)$ are the priors that are independent to the variational family. In our case,


In addition,



Please double-check our assumed model with the prior information to confirm the above settings.

So far we found ways to calculate $\text{ELBO}(q)$ in the coordinate ascent algorithm to measure the convergence. Next we are going to see how to do the update for the variational inference family instances. Recall we the following formula to update the parameters in the variational family.

In our case, for latent variable $c_i$

From the above derivation, we could see that only the following terms $\mathbb{E} \Big[ \log p(x_i | \boldsymbol{\mu}, c_i) \Big]$ and $\log p(c_i)$ are dependent on $c_i$, therefore, we have

According to the priors,


We have

All the terms that are independent to $c_i$ were moved to the const term.


Thus we have

However, the above updates were not normalized. We still need a normalizer value to update $\varphi_{ik}$ in practice.

Because $q^*(c_i = k; \varphi_i) = \varphi_{ik}$, to update each hyperparameter for $q^{\ast}(c_i)$, for $k = \{1,2,\cdots,K\}$

With normalizer, we have the final update equation for $\varphi_{ik}$, for $k = \{1,2,\cdots,K\}$

Similarly, for latent variable $\mu_k$, we have

According to the priors,


Because $q(\mu_k)$ is Gaussian, it is not hard to see that to update the hyperparameters, we have

Coordinate Ascent Pseudo-Code for Gaussian Mixture Variational Inference

Coordinate Ascent for Gaussian Mixture Variational Inference

Variational Inference with Exponential Family

If the posterior distribution and the approximate distribution are all from the exponential family, deriving the update function mathematically will be much easier. This section assumes that you have read my blog post of “Introduction to Exponential Family”.

Suppose each complete conditional is in the exponential family


We have

Therefore, $q_j^{\ast}(z_j)$ is in the same family as $p(z_j | z_{-j}, \boldsymbol{x})$, and the natural parameter for $q_j^{\ast}(z_j)$ should be update to the expected value of the natural parameter of the posterior distribution $\mathbb{E}_{q_{-j}} \big[ \eta_j(z_{-j}, \boldsymbol{x}) \big]$.

Let’s further take a look at the conditionally conjugate models. Let $\beta$ be a vector of global latent variables, which potentially govern any of the data. Let $\boldsymbol{z}$ be a vector of local latent variables, whose component only governs the $i$th observation. The joint density is

If we assume $p(z_i, x_i | \beta)$ is in the exponential family.

I have shown in my “Introduction to Exponential Family” post that one of its conjugate priors from the exponential family must have the natural form

Here, $\alpha$ is the natural parameter for $p(\beta)$, $\beta$ and $\alpha_1$ have the same dimensionality, and $\alpha_2$ is a scalar. The test statistics for $T(\beta) = [\beta, -A_l(\beta)]$ which is a column vector.

Then the complete likelihood

The complete posterior must have the form

where $\hat{\alpha} = [\hat{\alpha}_1, \hat{\alpha}_2]$ is the natural parameter for $p(\beta | \boldsymbol{z}, \boldsymbol{x})$

For $p(z_i|\boldsymbol{x}, z_{-i} \beta)$, because of the independent assumption,

We also assume this is in the exponential family.


We know from the above derivation that the update for the natural parameters for $q_{z_j}^{\ast}(z_j)$ and $q_\beta^{\ast}(\beta)$ respectively.

We assume the natural parameter for $q_{z_j}^{\ast}(z_j)$ and $q_\beta^{\ast}(\beta)$ are $\varphi_i$ and $\lambda$ respectively.

To update $\lambda$,

To update $\varphi_i$,

We will take a look at a concrete example of variational inference with exponential family, which is the Latent Dirichlet Allocation (LDA) model, using updates for natural parameters in one of my upcoming posts.

Final Remarks

There are a lot of missing details in the variational inference documentation and literature. I spent three days going through all the details and fill all the missing holes so that the readers could study the topic much easier.