Introduction to Exponential Family
Introduction
Exponential family is a set of probability distributions whose probability density function (or probability mass function, for the case of a discrete distribution) can be expressed in the form
$$
p(x|\eta) = h(x) \exp \big\{ T(x)^{\top} \eta - A(\eta) \big\}
$$
where $\eta$ is the parameter for the probability density function, which is independent of $x$, and $A(\eta)$ is also independent of $x$. $\eta$ is also called the natural parameter of the distribution, $T(x)$ is also called the sufficient statistics, $A(\eta)$ is also called the log normalizer (we will see why), $h(x)$ is also called the base measurement, and the above expression is also called the natural form of the distribution.
Many common distributions, such as normal distribution, categorical distribution, gamma distribution, Dirichlet distribution, etc, belong to the exponential family.
In this blog post, I will use normal distribution as an example to show how to derive $h(x)$, $T(x)$, $\eta$ and $A(\eta)$. I will also talk about some of the properties of the exponential family that we will use for variational inference.
Log Normalizer
Source of Name
Because $p(x|\eta)$ is a probability density, the integral of probability density is 1.
$$
\begin{aligned}
\int\limits_{x} p(x|\eta) dx &= \int\limits_{x} h(x) \exp \big\{ T(x)^{\top} \eta - A(\eta) \big\} dx \\
&= \frac{ \int\limits_{x} h(x) \exp \big\{ T(x)^{\top} \eta \big\} dx}{\exp\{ A(\eta) \}} =1
\end{aligned}
$$
We then have
$$
A(\eta) = \log \bigg[ \int\limits_{x} h(x) \exp \big\{ T(x)^{\top} \eta \big\} dx \bigg]
$$
Therefore $A(\eta)$ is a log normalizer for $h(x) \exp\{ T(x)^{\top} \eta\}$.
Derivative of Log Normalizer
The derivative of the log normalizer has an important property, which will be used in many other applications, such as variational inference.
$$
\frac{d}{d \eta} A(\eta) = \mathbb{E}_{x \sim p(x|\eta)} [T(x)]
$$
Let’s see how to derive this. Because
$$
\int\limits_{x} p(x|\eta) dx = \int\limits_{x} h(x) \exp \big\{ T(x)^{\top} \eta - A(\eta) \big\} dx = 1
$$
We take the derivative with respect to $\eta$,
$$
\frac{d}{d \eta} \int\limits_{x} p(x|\eta) dx = \frac{d}{d \eta} \int\limits_{x} h(x) \exp \big\{ T(x)^{\top} \eta - A(\eta) \big\} dx = 0
$$
We then use a special case of Leibniz Integral Rule.
$$
\begin{aligned}
\frac{d}{d \eta} \int\limits_{x} p(x|\eta) dx &= \frac{d}{d \eta} \int\limits_{x} h(x) \exp \big\{ T(x)^{\top} \eta - A(\eta) \big\} dx \\
&= \int\limits_{x} \frac{ \partial }{\partial \eta} \bigg[ h(x) \exp \big\{ T(x)^{\top} \eta - A(\eta) \big\} \bigg] dx\\
&= \int\limits_{x} \underbrace{ h(x) \exp \big\{ T(x)^{\top} \eta - A(\eta) \big\} }_{p(x|\eta)} [T(x) - \frac{d}{d \eta} A(\eta)] dx \\
&= \int\limits_{x} p(x|\eta) T(x) - \int\limits_{x} p(x|\eta) \frac{d}{d \eta} A(\eta) dx \\
&= \underbrace{ \int\limits_{x} p(x|\eta) T(x) }_{\mathbb{E}_{x \sim p(x|\eta)} [T(x)]} - \frac{d}{d \eta} A(\eta) \underbrace{ \int\limits_{x} p(x|\eta) dx }_{1} \\
&= \mathbb{E}_{x \sim p(x|\eta)} [T(x)] - \frac{d}{d \eta} A(\eta) \\
&= 0
\end{aligned}
$$
Therefore,
$$
\frac{d}{d \eta} A(\eta) = \mathbb{E}_{x \sim p(x|\eta)} [T(x)]
$$
Conjugate Priors
All members of the exponential family have conjugate priors. If you don’t remember what conjugate prior is, please check my blog post on conjugate priors. Here I will give the proof of this theorem.
Given likelihood $p(x|\beta)$ ($\beta$ is its natural parameter) from any member of the exponential family, we have to find a prior $p(\beta)$ that belongs to the same family of the posterior $p(x | \beta)$.
Because $p(x|\beta)$ is from the exponential family, we could write $p(x|\beta)$ in its natural form.
$$
p(x|\beta) = h(x) \exp \big\{ T(x)^{\top} \beta - A(\beta) \big\}
$$
Here we do not limit the size of $\beta$. It could be $\beta = \{ \beta_1, \beta_2, \cdots, \beta_N \}$ and $\beta$ is a column vector.
We then assume that we could find a conjugate prior $p(\beta)$ for the likelihood from the exponential family. We assume $p(\beta)$ has the following natural form.
$$
p(\beta) = h^{\prime}(\beta) \exp \big\{ T^{\prime}(\beta)^{\top} \alpha - A^{\prime}(\alpha) \big\}
$$
where
$$
T^{\prime}(\beta) = \begin{bmatrix} \beta \\ -A(\beta) \end{bmatrix}
$$
Note that because $\beta \in \mathbb{R}^{N}$ and $A(\beta)$ is a scalar, $T^{\prime}(\beta) \in \mathbb{R}^{N+1}$. $\alpha$ is the natural parameter for $p(\beta)$ and $\alpha \in \mathbb{R}^{N+1}$. We denote $\alpha$ into two parts
$$
\alpha = \begin{bmatrix} \alpha_1 \\ \alpha_2 \end{bmatrix}
$$
where $\alpha_1 \in \mathbb{R}^{N}$ and $\alpha_2 \in \mathbb{R}^{1}$.
According to Bayes’ theorem,
$$
p(\beta | x) = \frac{p(x|\beta) p(\beta)}{p(x)} \propto p(x | \beta) p(\beta)
$$
We use $\propto$ here because $p(x)$ is a constant value to $p(\beta | x)$ thus will not change the family of $p(x | \beta) p(\beta)$.
$$
\begin{aligned}
p(\beta | x) &\propto p(x | \beta) p(\beta) \\
&\propto h(x) \exp \big\{ T(x)^{\top} \beta - A(\beta) \big\} h^{\prime}(\beta) \exp \big\{ T^{\prime}(\beta)^{\top} \alpha - A^{\prime}(\alpha)\big\} \\
&\propto h^{\prime}(\beta) \exp \big\{ T(x)^{\top} \beta - A(\beta) + T^{\prime}(\beta)^{\top} \alpha \big\} \\
&\propto h^{\prime}(\beta) \exp \Big\{ T(x)^{\top} \beta - A(\beta) + \big[ \beta^{\top}\alpha_1 - A(\beta)\alpha_2 \big] \Big\} \\
&\propto h^{\prime}(\beta) \exp \Big\{ T(x)^{\top} \beta - A(\beta) + \big[ \alpha_1^{\top}\beta - A(\beta)\alpha_2 \big]\Big\} \\
&\propto h^{\prime}(\beta) \exp \Big\{ \big[ T(x)^{\top} + \alpha_1^{\top} \big] \beta - A(\beta) (1 + \alpha_2) \Big\} \\
&\propto h^{\prime}(\beta) \exp \Big\{ \big[ T(x) + \alpha_1 \big]^{\top} \beta - (1 + \alpha_2) A(\beta) \Big\} \\
\end{aligned}
$$
We define
$$
\hat{\alpha}_1 = T(x) + \alpha_1 \\
\hat{\alpha}_2 = (1 + \alpha_2) \\
\hat{\alpha} = \begin{bmatrix} \hat{\alpha}_1 \\ \hat{\alpha}_2 \end{bmatrix}
$$
Then we have
$$
\begin{aligned}
p(\beta | x) &\propto p(x | \beta) p(\beta) \\
&\propto h^{\prime}(\beta) \exp \Big\{ \hat{\alpha}_1^{\top} \beta - \hat{\alpha}_2 A(\beta) \Big\} \\
&\propto h^{\prime}(\beta) \exp \Big\{ \begin{bmatrix} \hat{\alpha}_1 \\ \hat{\alpha}_2 \end{bmatrix}^{\top} T^{\prime}(\beta) \Big\} \\
&\propto h^{\prime}(\beta) \exp \Big\{ T^{\prime}(\beta)^{\top} \begin{bmatrix} \hat{\alpha}_1 \\ \hat{\alpha}_2 \end{bmatrix} \Big\} \\
&\propto h^{\prime}(\beta) \exp \Big\{ T^{\prime}(\beta)^{\top} \hat{\alpha} \Big\} \\
\end{aligned}
$$
We could see that $p(\beta | x)$ also belongs to the exponential family. Therefore we could always find a conjugate prior for the likelihood from the exponential family.
Natural Form of Distribution
Normal Distribution
For normal distribution, we have probability density:
$$
p(x|\theta) = p(x | \mu, \sigma^2) = \frac{1}{\sqrt{2\pi\sigma^2}} e^ { -\frac{(x - \mu)^2}{2\sigma^2} }
$$
Here $\theta$ is called the parameter of the distribution, compared to the natural parameter $\eta$ of the distribution. We will convert this form to the natural form of normal distribution.
$$
\begin{aligned}
p(x | \mu, \sigma^2) &= \frac{1}{\sqrt{2\pi\sigma^2}} \exp \bigg\{ -\frac{(x - \mu)^2}{2\sigma^2} \bigg\} \\
&= \exp \bigg\{ -\frac{(x - \mu)^2}{2\sigma^2} + \frac{-1}{2} \log(2 \pi \sigma^2) \bigg\} \\
&= \exp \bigg\{ -\frac{1}{2\sigma^2} x^2 + \frac{\mu}{\sigma^2} x - \frac{\mu^2}{2\sigma^2} - \frac{1}{2} \log(2 \pi \sigma^2) \bigg\} \\
&= \exp \bigg\{ -\frac{1}{2\sigma^2} x^2 + \frac{\mu}{\sigma^2} x \bigg\} \exp \bigg\{- \frac{\mu^2}{2\sigma^2} - \frac{1}{2} \log(2 \pi \sigma^2) \bigg\} \\
&= \exp \Bigg\{ \begin{bmatrix} x\\ x^2 \end{bmatrix}^{\top} \begin{bmatrix} \frac{\mu}{\sigma^2}\\ -\frac{1}{2\sigma^2} \end{bmatrix} \Bigg\} \exp \bigg\{- \frac{\mu^2}{2\sigma^2} - \frac{1}{2} \log(2 \pi \sigma^2) \bigg\} \\
&= \exp \Bigg\{ -\frac{1}{2} \log (2\pi) \Bigg\} \exp \Bigg\{ \begin{bmatrix} x\\ x^2 \end{bmatrix}^{\top} \begin{bmatrix} \frac{\mu}{\sigma^2}\\ -\frac{1}{2\sigma^2} \end{bmatrix} \Bigg\} \exp \bigg\{- \frac{\mu^2}{2\sigma^2} - \frac{1}{2} \log \sigma^2 \bigg\} \\
&= \frac{1}{\sqrt{2\pi}} \exp \Bigg\{ \begin{bmatrix} x\\ x^2 \end{bmatrix}^{\top} \begin{bmatrix} \frac{\mu}{\sigma^2}\\ -\frac{1}{2\sigma^2} \end{bmatrix} - \bigg\{\frac{\mu^2}{2\sigma^2} + \frac{1}{2} \log \sigma^2 \bigg\} \Bigg\} \\
\end{aligned}
$$
So far, we could see that the base measurement, the sufficient statistics and the natural parameter for normal distribution are
$$
h(x) = \frac{1}{\sqrt{2\pi}}\\
T(x) = \begin{bmatrix} x\\ x^2 \end{bmatrix}\\
\eta = \begin{bmatrix} \eta_1\\ \eta_2 \end{bmatrix} = \begin{bmatrix} \frac{\mu}{\sigma^2}\\ -\frac{1}{2\sigma^2} \end{bmatrix}
$$
It is not hard to see that
$$
\theta = \begin{bmatrix} \mu\\ \sigma^2 \end{bmatrix} = \begin{bmatrix} -\frac{\eta_1}{2\eta_2}\\ -\frac{1}{2\eta_2} \end{bmatrix}
$$
Thus,
$$
\begin{aligned}
& \frac{\mu^2}{2\sigma^2} + \frac{1}{2} \log \sigma^2 \\
&= \bigg[ \frac{\big( -\frac{\eta_1}{2\eta_2} \big)^2}{-2\frac{1}{2\eta_2}} \bigg] + \frac{1}{2} \log \bigg[ - \frac{1}{2\eta_2} \bigg]\\
&= - \frac{\eta_1 ^2}{4\eta_2} - \frac{1}{2} \log(-2\eta_2) \\
\end{aligned}
$$
This term is actually the log normalizer of the distribution.
$$
A(\eta) = - \frac{\eta_1 ^2}{4\eta_2} - \frac{1}{2} \log(-2\eta_2)
$$
Therefore, the natural form of normal distribution is
$$
\begin{aligned}
p(x | \eta) &= \underbrace{ \frac{1}{\sqrt{2\pi}} }_{h(x)} \exp \Bigg\{ {\underbrace{ \begin{bmatrix} x\\ x^2 \end{bmatrix} }_{T(x)}} ^{\top} \underbrace{ \begin{bmatrix} \eta_1\\ \eta_2 \end{bmatrix} }_{\eta} - \underbrace{ \big[ -\frac{\eta_1 ^2}{4\eta_2} - \frac{1}{2} \log(-2\eta_2) \big] }_{A(\eta)} \Bigg\}\\
\end{aligned}
$$
We compared this to the natural form of normal distribution on Wikipedia. They are exactly the same.
Maximum Likelihood Estimation
Original Form
We have a collection of samples $X = \{ x_1, x_2, \cdots, x_N \}$. Each sample $x_i$ is indepdently and identically drawn from a normal distribution $\mathcal{N}(\mu, \sigma^2)$.
$$
p(X | \theta) = p(X | \mu, \sigma^2) = \prod_{i=1}^{N} p(x_i | \mu, \sigma^2)
$$
To maximize $p(X | \theta) $, it is equivalent to maximize $\log p(X | \theta) $, we have
$$
\begin{aligned}
\DeclareMathOperator*{\argmax}{argmax}
\hat{\theta} &= \argmax_{\theta} \log p(X | \theta) \\
&= \argmax_{\mu, \sigma^2} \log p(X | \mu, \sigma^2) \\
&= \argmax_{\mu, \sigma^2} \Big\{ \sum_{i=1}^{N} \log p(x_i | \mu, \sigma^2) \Big\}\\
&= \argmax_{\mu, \sigma^2} \Bigg\{ \sum_{i=1}^{N} \log \bigg[ \frac{1}{\sqrt{2\pi\sigma^2}} \exp \big\{ -\frac{(x_i-\mu)^2}{2\sigma^2} \big\} \bigg] \Bigg\}\\
&= \argmax_{\mu, \sigma^2} \Bigg\{ N\log \Big( \frac{1}{\sqrt{2\pi\sigma^2}} \Big) + \sum_{i=1}^{N} \bigg[ -\frac{(x_i-\mu)^2}{2\sigma^2} \bigg] \Bigg\}\\
&= \argmax_{\mu, \sigma^2} \Bigg\{ -\frac{N}{2} \log \Big( 2\pi\sigma^2 \Big) - \frac{1}{2\sigma^2} \sum_{i=1}^{N} \bigg[ (x_i-\mu)^2 \bigg] \Bigg\}\\
\end{aligned}
$$
We first take the derivative with respect to $\mu$.
$$
\begin{aligned}
\frac{\partial}{\partial \mu} \log p(X | \mu, \sigma^2) &= \frac{1}{2\sigma^2} \sum_{i=1}^{N} 2(x_i - \mu) \\
&= \frac{1}{\sigma^2} \bigg[ \Big( \sum_{i=1}^{N} x_i \Big) - N \mu
\bigg] \\
&= 0
\end{aligned}
$$
We then take the derivative with respect to $\sigma^2$.
$$
\begin{aligned}
\frac{\partial}{\partial \sigma^2} \log p(X | \mu, \sigma^2) &= -\frac{N}{2} \frac{1}{\sigma^2} - \frac{1}{2} \frac{-1}{\sigma^4} \sum_{i=1}^{N} \Big[ (x_i-\mu)^2 \Big] \\
&= \frac{1}{2\sigma^2} \bigg[ -N + \frac{1}{\sigma^2} \sum_{i=1}^{N} \Big[ (x_i-\mu)^2 \Big] \bigg] \\
&= \frac{1}{2\sigma^2} \bigg[ -N + \frac{1}{\sigma^2} \Big[ \sum_{i=1}^{N} x_i^2 - 2\mu \sum_{i=1}^{N} x_i + N \mu^2 \Big] \bigg] \\
&= \frac{1}{2\sigma^2} \bigg[ -N + \frac{1}{\sigma^2} \Big[ \sum_{i=1}^{N} x_i^2 - N\mu^2 \Big] \bigg] \\
&= 0
\end{aligned}
$$
We finally solve the above two equations. We got
$$
\begin{aligned}
\hat{\mu} &= \frac{\sum_{i=1}^{N} x_i}{N} \\
\hat{\sigma}^2 &= \frac{\sum_{i=1}^{N} x_i^2}{N} - \hat{\mu}^2 \\
&= \frac{\sum_{i=1}^{N} x_i^2}{N} - \big(\frac{\sum_{i=1}^{N} x_i}{N}\big)^2
\end{aligned}
$$
Therefore,
$$
\hat{\theta} = \begin{bmatrix} \hat{\mu}\\ \hat{\sigma}^2 \end{bmatrix} = \begin{bmatrix} \frac{\sum_{i=1}^{N} x_i}{N} \\ \frac{ \sum_{i=1}^{N}x_i^2 }{N} - \big[ \frac{\sum_{i=1}^{N}x_i}{N} \big]^2 \end{bmatrix}
$$
We could see that it is complicated to derive these. Moreover, if the distribution where the sample is drawn from are changed, we need to derive again from scratch using the probability density of the new distribution.
Natural Form
We now use the natural form of normal distribution to do maximum likelihood estimation for the same task.
$$
p(X | \eta) = \prod_{i=1}^{N} p(x_i | \eta)
$$
To maximize $p(X | \eta)$, it is equivalent to maximize $\log p(X | \eta)$, we have
$$
\begin{aligned}
\DeclareMathOperator*{\argmax}{argmax}
\hat{\eta} &= \argmax_{\eta} \log p(X | \eta) \\
&= \argmax_{\eta} \Big\{ \sum_{i=1}^{N} \log p(x_i | \eta) \Big\}\\
&= \argmax_{\eta} \bigg\{ \sum_{i=1}^{N} \log \Big[ h(x_i) \exp \big\{ T(x_i)^{\top} \eta - A(\eta) \big\} \Big] \bigg\}\\
&= \argmax_{\eta} \bigg\{ \sum_{i=1}^{N} \Big[ \log h(x_i) + T(x_i)^{\top} \eta - A(\eta) \Big] \bigg\}\\
&= \argmax_{\eta} \bigg\{ \sum_{i=1}^{N} \log h(x_i) + \sum_{i=1}^{N} T(x_i)^{\top} \eta - N A(\eta) \bigg\}\\
\end{aligned}
$$
We take the derivative with respect to $\eta$.
$$
\begin{aligned}
\frac{\partial}{\partial \eta} \log p(X | \eta) &= \sum_{i=1}^{N} T(x_i) - N \frac{d}{d\eta} A(\eta)\\
&= 0
\end{aligned}
$$
Therefore,
$$
\frac{d}{d\eta} A(\eta) \Bigr|_{\hat{\eta}} = A^{\prime}(\hat{\eta})= \frac{\sum_{i=1}^{N} T(x_i)}{N} = \mathbb{E}_{x \in X} [T(x)]
$$
This is a general equation to do maximum likelihood estimation for all the distributions from the exponential family.
Specifically for normal distribution,
$$
T(x_i) = \begin{bmatrix} x_i\\ x_i^2 \end{bmatrix}\\
\eta = \begin{bmatrix} \eta_1\\ \eta_2 \end{bmatrix} = \begin{bmatrix} \frac{\mu}{\sigma^2}\\ -\frac{1}{2\sigma^2} \end{bmatrix}\\
A(\eta) = - \frac{\eta_1 ^2}{4\eta_2} - \frac{1}{2} \log(-2\eta_2)
$$
$$
\begin{aligned}
\frac{d}{d\eta} A(\eta) \Bigr|_{\hat{\eta}} &= A^{\prime}(\hat{\eta}) \\
&= \begin{bmatrix} \frac{\partial}{\eta_1} A(\eta)\\ \frac{\partial}{\eta_2} A(\eta) \end{bmatrix}_{\hat{\eta}}\\
&= \begin{bmatrix} -\frac{\eta_1}{2\eta_2} \\ -\frac{\eta_1^2}{4} \frac{-1}{\eta_2^2} - \frac{1}{2} \frac{-2}{-2\eta_2} \end{bmatrix} _{\hat{\eta}} \\
&= \begin{bmatrix} -\frac{\eta_1}{2\eta_2} \\ \frac{\eta_1^2}{4\eta_2^2} - \frac{1}{2\eta_2} \end{bmatrix}_{\hat{\eta}}\\
&= \begin{bmatrix} -\frac{\hat{\eta}_1}{2\hat{\eta}_2} \\ \frac{\hat{\eta}_1^2}{4\hat{\eta}_2^2} - \frac{1}{2\hat{\eta}_2} \end{bmatrix} \\
&= \frac{\sum_{i=1}^{N} T(x_i)}{N}\\
&= \begin{bmatrix} \frac{\sum_{i=1}^{N} x_i}{N}\\ \frac{\sum_{i=1}^{N} x_i^2}{N} \end{bmatrix}\\
\end{aligned}
$$
We solve the above equations
$$
\hat{\eta} = \begin{bmatrix} \hat{\eta}_1\\ \hat{\eta}_2 \end{bmatrix} = \begin{bmatrix} -\frac{\frac{\sum_{i=1}^{N}x_i}{N}}{\big[ \frac{\sum_{i=1}^{N}x_i}{N} \big]^2 - \frac{ \sum_{i=1}^{N}x_i^2 }{N} }\\ \frac{\frac{1}{2}}{\big[ \frac{\sum_{i=1}^{N}x_i}{N} \big]^2 - \frac{ \sum_{i=1}^{N}x_i^2 }{N} } \end{bmatrix}
$$
Let’s finally check if the maximum likelihood estimate solutions from the original form and the natural form are equivalent. Note that this step is not required in practice since it is redundant.
$$
\hat{\eta} = \begin{bmatrix} \hat{\eta}_1\\ \hat{\eta}_2 \end{bmatrix} = \begin{bmatrix} -\frac{\frac{\sum_{i=1}^{N}x_i}{N}}{\big[ \frac{\sum_{i=1}^{N}x_i}{N} \big]^2 - \frac{ \sum_{i=1}^{N}x_i^2 }{N} }\\ \frac{\frac{1}{2}}{\big[ \frac{\sum_{i=1}^{N}x_i}{N} \big]^2 - \frac{ \sum_{i=1}^{N}x_i^2 }{N} } \end{bmatrix} = \begin{bmatrix} \frac{\hat{\mu}}{\hat{\sigma}^2}\\ -\frac{1}{2\hat{\sigma}^2} \end{bmatrix}\\
$$
$$
\hat{\theta} = \begin{bmatrix} \hat{\mu}\\ \hat{\sigma}^2 \end{bmatrix} = \begin{bmatrix} \frac{\sum_{i=1}^{N} x_i}{N} \\ \frac{ \sum_{i=1}^{N}x_i^2 }{N} - \big[ \frac{\sum_{i=1}^{N}x_i}{N} \big]^2 \end{bmatrix}
$$
It is exactly the same as what we got from the original form. But it is much simpler because we already know $T(x)$, $\eta$, and $A(\eta)$. For other distributions from the exponential family, their corresponding $T(x)$, $\eta$, and $A(\eta)$ could be found on Wikipedia.
Final Remarks
The natural forms of all the exponential family members could be found on Wikipedia. Life becomes easier.
References
Introduction to Exponential Family
https://leimao.github.io/blog/Introduction-to-Exponential-Family/