# Marginal Likelihood Estimation

## Introduction

In probabilistic model that has latent variables, computing the marginal likelihood is often intractable. Estimating the marginal likelihood also might not be as straightforward as it seems.

In this article, we will discuss the marginal likelihood in Bayesian inference, the caveat of the naive sampling method for estimating the marginal likelihood, and the importance sampling method for estimating the marginal likelihood.

## Marginal Likelihood

According to Bayes’ theorem, it is universally true that

$$

\underbrace{p(z|x)}_\text{posterior} = \frac{p(x|z) p(z)}{\underbrace{p(x)}_\text{marginal likelihood}} \propto \underbrace{p(x|z)}_\text{likelihood} \underbrace{p(z)}_\text{prior} \propto \underbrace{p(x,z)}_\text{joint}

$$

The marginal likelihood is the expected value of the likelihood function over the prior distribution of the conditional variable.

$$

\begin{align}

p(x) &= \mathbb{E}_{z \sim p(z)} [p(x|z)] \\

\end{align}

$$

The marginal likelihood can be computed by integrating the joint distribution over the conditional variable if the conditional variable is continuous or summing over the conditional variable if the conditional variable is discrete.

$$

\begin{align}

p(x) &= \int p(x | z) p(z) dz \\

p(x) &= \sum p(x | z) p(z)

\end{align}

$$

Compute the exact value of the marginal likelihood is often intractable. Imagine the conditional variable $z$ is a high-dimensional variable, the integral or sum over $z$ can be extremely computationally expensive.

## Marginal Likelihood Estimation

Because computing the exact value of the marginal likelihood is often intractable, we wonder if we can estimate the marginal likelihood using some approximation methods, such as Monte Carlo methods.

### Naive Sampling Method

One intuitive and naive way to estimate the marginal likelihood is to use the Monte Carlo method by drawing samples from the prior distribution of the conditional variable and computing the likelihood function at each sample. The average of the likelihood function over the samples is an estimate of the marginal likelihood.

$$

\begin{align}

p(x) &\approx \frac{1}{N} \sum_{i=1}^N p(x | z_i) \\

\end{align}

$$

where $z_i \sim p(z)$ and $N$ is the number of samples.

According to the central limit theorem, the estimate of the sample mean is the true mean of the likelihood and the variance of the estimate decreases as the number of samples increases.

This “simple” Monte Carlo method is an unbiased estimator of the marginal likelihood. However, the variance of the estimator can be very high if the dimension of the conditional variable is high and the number of samples is small.

Imagine the likelihood function is mostly zero except for a very small region. The samples drawn from the prior distribution of the conditional variable are unlikely to fall into the region where the likelihood function is non-zero, which means that the really useful samples are very few. This problem will become even more severe if the dimension of the conditional variable is high and the chance of the samples falling into the region where the likelihood function is non-zero is even smaller.

This means that this estimate of the marginal likelihood can be very bad if the number of samples is small. Especially when the prior and likelihood probability distributions are being optimized using some iterative optimization algorithms, the estimate of the marginal likelihood in each iteration can hardly be accurate.

### Importance Sampling Method

In general, the marginal likelihood can also be the expected value of the likelihood function over a different distribution of the conditional variable.

$$

\begin{align}

p(x) &= \mathbb{E}_{z \sim q(z)} \left[ \frac{p(x, z)}{q(z)} \right] \\

\end{align}

$$

where $q(z)$ is the proposal distribution of the conditional variable. Note that this is somewhat similar to the KL divergence formula but not exactly the same.

The variance of the estimator can be written as follows.

$$

\begin{align}

\text{Var}_{z \sim q(z)} \left[ \frac{p(x, z)}{q(z)} \right]

&= \mathbb{E}_{z \sim q(z)} \left[ \left( \frac{p(x, z)}{q(z)} - \mathbb{E}_{z \sim q(z)} \left[ \frac{p(x, z)}{q(z)} \right] \right) \right]^2 \\

&= \mathbb{E}_{z \sim q(z)} \left[ \left( \frac{p(x, z)}{q(z)} - p(x) \right) \right]^2 \\

\end{align}

$$

When the proposal distribution $q(z)$ is the same as the posterior distribution $p(z|x)$, the variance of the estimator is minimized and is equal to zero.

$$

\begin{align}

\text{Var}_{z \sim p(z|x)} \left[ \frac{p(x, z)}{p(z|x)} \right]

&= \mathbb{E}_{z \sim p(z|x)} \left[ \left( \frac{p(x, z)}{p(z|x)} - p(x) \right) \right]^2 \\

&= \mathbb{E}_{z \sim p(z|x)} \left[ \left( p(x) - p(x) \right) \right]^2 \\

&= 0

\end{align}

$$

So if we can find a proposal distribution $q(z)$ that is close to the posterior distribution $p(z|x)$, the variance of the estimator can be minimized, and the estimate of the marginal likelihood can be more accurate. The importance sampling estimator can be expressed as follows.

$$

\begin{align}

p(x) &\approx \frac{1}{N} \sum_{i=1}^N \frac{p(x, z_i)}{q(z_i)} \\

\end{align}

$$

where $z_i \sim q(z)$ and $N$ is the number of samples.

In the previous naive sampling method, $q(z) = p(z)$, which can usually be dramatically different from the posterior distribution $p(z|x)$.

$$

\begin{align}

\text{Var}_{z \sim p(z)} \left[ \frac{p(x, z)}{p(z)} \right]

&= \mathbb{E}_{z \sim p(z)} \left[ \left( \frac{p(x, z)}{p(z)} - p(x) \right) \right]^2 \\

&= \mathbb{E}_{z \sim p(z)} \left[ \left( p(x | z) - p(x) \right) \right]^2 \\

\end{align}

$$

Consequently, the variance of the naive sampling estimator can be very high.

## Conclusions

Sampling from the prior distributions is not informative enough to estimate the marginal likelihood. Importance sampling from a proposal distribution that is close to the posterior distribution can provide a more accurate estimate of the marginal likelihood.

Many optimization algorithms, such as variational inference and variational autoencoder, will learn a proposal distribution that is close to the posterior distribution, which can be used as the proposal distribution for importance sampling to estimate the marginal likelihood.

## References

Marginal Likelihood Estimation

https://leimao.github.io/blog/Marginal-Likelihood-Estimation/