Conjugate Priors
Introduction
Conjugate priors have been widely used in many Bayesian inference techniques. Therefore it is very important to understand its basic concepts. In this blog post, I will introduce the concept of conjugate priors and give a concrete example of proving conjugate priors.
Definition
According to Bayes’ theorem, it is universally true that
$$
\underbrace{p(\theta|x)}_\text{posterior} = \frac{p(x|\theta) p(\theta)}{p(x)} \propto \underbrace{p(x|\theta)}_\text{likelihood} \underbrace{p(\theta)}_\text{prior} \propto p(x,\theta)
$$
In Bayesian probability theory, if the posterior distributions $p(\theta|x)$ are in the same probability distribution family as the prior probability distribution $p(\theta)$, the prior and posterior are then called conjugate distributions, and the prior is called a conjugate prior for the likelihood function $p(x|\theta)$.
Example
If we have random variable $\mu$ from normal distribution $\mathcal{N}(\mu_0, \sigma_0^2)$, i.e.,
$$
p(\mu) = \frac{1}{\sqrt{2\pi\sigma_0^2}} e^{-\frac{(\mu-\mu_0)^2}{2\sigma_0^2}}
$$
We also have random variable $x$ from normal distribution $\mathcal{N}(\mu, \sigma^2)$, i.e.,
$$
p(x|\mu) = \frac{1}{\sqrt{2\pi\sigma^2}} e^{-\frac{(x-\mu)^2}{2\sigma^2}}
$$
Note that $\mu_0, \sigma_0^2, \sigma^2$ are not random variable but prefixed parameters in this case.
We have $N$ data points $X=\{x_1, x_2, \cdots, x_N\}$ independently sampled from normal distribution $\mathcal{N}(\mu, \sigma^2)$.
$$
p(X|\mu) = \Bigg[ \frac{1}{\sqrt{2\pi\sigma^2}} \Bigg]^N e^{-\frac{\sum_{i=1}^{N}(x_i-\mu)^2}{2\sigma^2}}
$$
We will show that $p(\mu|X)$ is also a normal distribution, thus the posterior $p(\mu|X)$ and the prior $p(\mu)$ are in the same normal distribution family and they are conjugate. $p(\mu)$ is a conjugate prior for the likelihood $p(X|\mu)$.
$$
\begin{aligned}
p(\mu|X) &= \frac{p(X|\mu)p(\mu)}{p(X)} \\
&\propto p(X|\mu)p(\mu) \\
&\propto \exp{- \big[ \frac{\sum_{i=1}^{N}(x_i-\mu)^2}{2\sigma^2} + \frac{(\mu -\mu_0)^2}{2\sigma_0^2} \big] }
\end{aligned}
$$
where
$$
\begin{aligned}
&\frac{\sum_{i=1}^{N}(x_i-\mu)^2}{2\sigma^2} + \frac{(\mu -\mu_0)^2}{2\sigma_0^2} \\
&= \frac{1}{2\sigma^2}\sum_{i=1}^{N}(x_i^2 - 2x_i\mu + \mu^2) + \frac{1}{2\sigma_0^2}(\mu - \mu_0)^2\\
&= \frac{1}{2\sigma^2}\sum_{i=1}^{N}x_i^2 + \frac{1}{2\sigma^2} 2\mu \sum_{i=1}^{N} x_i + \frac{1}{2\sigma^2} N \mu^2 + \frac{1}{2\sigma_0^2}\mu^2 - \frac{1}{2\sigma_0^2} 2\mu\mu_0 + \frac{1}{2\sigma_0^2} \mu_0^2 \\
&= (\frac{N}{2\sigma^2} + \frac{1}{2\sigma_0^2}) \mu^2 - 2\mu ( \frac{\sum_{i=1}^{N} x_i}{2\sigma^2} + \frac{\mu_0}{2\sigma_0^2} ) + \text{const} \\
&= \frac{1}{(\frac{N}{2\sigma^2} + \frac{1}{2\sigma_0^2})^{-1}} \Bigg[ \mu - \frac{\frac{\sum_{i=1}^{N} x_i}{2\sigma^2} + \frac{\mu_0}{2\sigma_0^2}}{\frac{N}{2\sigma^2} + \frac{1}{2\sigma_0^2}} \Bigg]^2 + \text{const} \\
&= \frac{1}{2(\frac{N}{\sigma^2} + \frac{1}{\sigma_0^2})^{-1}} \Bigg[ \mu - \frac{\frac{N\bar{X}}{\sigma^2} + \frac{\mu_0}{\sigma_0^2}}{\frac{N}{\sigma^2} + \frac{1}{\sigma_0^2}} \Bigg]^2 + \text{const} \\
\end{aligned}
$$
Here $\text{const}$ are terms irrelevant to random variable $\mu$. For simplicity, we define:
$$
\mu^{\prime} = \frac{\frac{N\bar{X}}{\sigma^2} + \frac{\mu_0}{\sigma_0^2}}{\frac{N}{\sigma^2} + \frac{1}{\sigma_0^2}} \\
\sigma^{\prime2} = (\frac{N}{\sigma^2} + \frac{1}{\sigma_0^2})^{-1}
$$
Therefore, $p(\mu|X)$ follows normal distribution $\mathcal{N}(\mu^{\prime}, \sigma^{\prime2})$. Since both the posterior $p(\mu|X)$ and the prior $p(\mu)$ are in the normal distribution family, we say that $p(\mu) \sim \mathcal{N}(\mu_0, \sigma_0^2)$ is a conjugate prior for the likelihood $p(X|\mu)$. It turns out that $p(X|\mu)$ is a multivariate normal distribution by definition.
Caveats
By the way, the distribution of the mean of $X$, $p(\bar{X}|\mu)$, is an exact normal distribution. But the proof I found was oversimplified.
Here is what they did. They first define random variables:
$$
\bar{X} = \frac{1}{N} \sum_{i=1}^{N} x_i\\
s^2 = \frac{1}{N-1} \sum_{i=1}^{N} (x_i - \bar{X})^2
$$
Then they transform the probability density expression for $p(X|\mu)$:
$$
\begin{aligned}
p(X|\mu) &\propto \exp{-\frac{\sum_{i=1}^{N}(x_i-\mu)^2}{2\sigma^2}} \\
&\propto \exp{-\frac{\sum_{i=1}^{N} \big[ (x_i - \bar{X}) + (\bar{X} - \mu) \big]^2}{2\sigma^2}}\\
&\propto \exp{-\frac{\sum_{i=1}^{N} (x_i - \bar{X})^2 + 2\sum_{i=1}^{N} (x_i - \bar{X})(\bar{X} - \mu) + \sum_{i=1}^{N} (\bar{X} - \mu)^2 }{2\sigma^2}}\\
\end{aligned}
$$
It is not hard to see that
$$
\sum_{i=1}^{N} (x_i - \bar{X})^2 = (N-1)s^2\\
\sum_{i=1}^{N} (\bar{X} - \mu)^2 = N(\bar{X} - \mu)^2
$$
Moreover,
$$
\begin{aligned}
\sum_{i=1}^{N} (x_i - \bar{X})(\bar{X} - \mu) &= \sum_{i=1}^{N}(x_i\bar{X} -x_i\mu - {\bar{X}}^2 + \bar{X}\mu)\\
&= \bar{X}\sum_{i=1}^{N} x_i - \mu \sum_{i=1}^{N} x_i - N {\bar{X}}^2 + N \bar{X}\mu\\
&= \bar{X}N\bar{X} - \mu N\bar{X} - N {\bar{X}}^2 + N \bar{X}\mu\\
&= 0
\end{aligned}
$$
Therefore,
$$
\begin{aligned}
p(X|\mu) &\propto \exp{-\frac{\sum_{i=1}^{N}(x_i-\mu)^2}{2\sigma^2}} \\
&\propto \exp{\frac{N(\bar{X} - \mu)^2}{2\sigma^2}} \exp{\frac{(N-1)s^2}{2\sigma^2}}\\
\end{aligned}
$$
This means
$$
\begin{aligned}
p(\bar{X}|\mu) &\propto \exp{\frac{N(\bar{X} - \mu)^2}{2\sigma^2}} \exp{\frac{(N-1)s^2}{2\sigma^2}}\\
\end{aligned}
$$
Is random variable $s$ related to $\bar{X}$? It has been proved that $s$ and $\bar{X}$ are independent, given the samples are normally distributed. One of the proofs could be found here, which surprised me because I never thought it would be non-trivial. We may come back to this proof in the future.
Therefore, we could remove $\exp\{\frac{(N-1)s^2}{2\sigma^2}\}$ from the right side since $s$ and $\bar{X}$ are independent.
$$
\begin{aligned}
p(\bar{X}|\mu) &\propto \exp{\frac{N(\bar{X} - \mu)^2}{2\sigma^2}} \exp{\frac{(N-1)s^2}{2\sigma^2}}\\
&\propto \exp{\frac{N(\bar{X} - \mu)^2}{2\sigma^2}}
\end{aligned}
$$
Therefore, $p(\bar{X}|\mu)$ strictly follows the exact normal distribution $\mathcal{N}(\bar{X} | \mu; \frac{\sigma^2}{N})$.
This has something related to the central limit theorem. The central limit theorem states as the sample size become sufficiently large, the distribution of sample means approximates to normal distribution, regardless of what distribution the samples come from. It approximates normal distribution, but it might not be the exact normal distribution. In our case, with our derivation, we showed that the distribution of sample means is an exact normal distribution if the samples come from a normal distribution.
Final Remarks
Deriving and proving conjugate priors could be complex and tedious. Fortunately, Wikipedia has documented all the known conjugate prior for certain likelihoods so far.
One last important property, all members of the exponential family have conjugate priors. This property is very useful in variational inference. Normal distribution is one of the members of the exponential family. For example, as we have shown, in our case $p(x|\mu)$ (there is only one sample in $X$) does have a conjugate prior which follows normal distribution.
Acknowledgement
I would like to thank CrazyStat from Hacker News for providing critical comments and corrections to this post.
References
Conjugate Priors