Maximum Likelihood Estimation VS Maximum A Posteriori Estimation
Introduction
In non-probabilistic machine learning, maximum likelihood estimation (MLE) is one of the most common methods for optimizing a model. In probabilistic machine learning, we often see maximum a posteriori estimation (MAP) rather than maximum likelihood estimation for optimizing a model.
In this blog post, I would like to discuss the connections between the MLE and MAP methods.
Bayes’ Theorem
Bayes’ theorem is stated mathematically as the following equation.
$$
\DeclareMathOperator*{\argmin}{argmin}
\DeclareMathOperator*{\argmax}{argmax}
\underbrace{P(A|B)}_\text{posterior} = \frac{\underbrace{P(B|A)}_\text{likelihood} \underbrace{P(A)}_\text{prior}}{\underbrace{P(B)}_\text{marginal}}
$$
where $P(A|B)$ is the posterior probability, $P(B|A)$ is the likelihood probability, $P(A)$ and $P(B)$ are prior probabilities, and $P(B)$ is also often referred as the marginal probability as it is an marginalization of the joint probability $P(A, B)$ over variable $A$.
If $A$ is a continuous variable,
$$
P(B) = \int_{A}^{} P(A, B) d A = \int_{A}^{} P(B | A) P(A) d A
$$
If $A$ is a discrete variable,
$$
P(B) = \sum_{A}^{} P(A, B) = \sum_{A}^{} P(B | A) P(A)
$$
Notice that $P(B)$ is a constant with respect to the variable $A$, so we could safely say $P(A|B)$ is proportional to $P(B|A) P(A)$ with respect to the variable $A$. Mathematically,
$$
P(A|B) \propto P(B|A) P(A)
$$
Maximum Likelihood Estimation (MLE)
Maximum likelihood estimation, as is stated in its name, maximizes the likelihood probability $P(B|A)$ in Bayes’ theorem with respect to the variable $A$ given the variable $B$ is observed.
Mathematically, maximum likelihood estimation could be expressed as
$$
a^{\ast}_{\text{MLE}} = \argmax_{A} P(B = b | A)
$$
It is equivalent to optimizing in the log domain since $P(B = b | A) \geq 0$ and assuming $P(B = b | A) \neq 0$.
$$
a^{\ast}_{\text{MLE}} = \argmax_{A} \log P(B = b | A)
$$
Maximum A Posteriori Estimation (MAP)
Maximum a posteriori estimation, as is stated in its name, maximizes the posterior probability $P(A | B)$ in Bayes’ theorem with respect to the variable $A$ given the variable $B$ is observed.
Mathematically, maximum a posteriori estimation could be expressed as
$$
a^{\ast}_{\text{MAP}} = \argmax_{A} P(A | B = b)
$$
It is equivalent to optimizing in the log domain since $P(A | B = b) \geq 0$ and assuming $P(A | B = b) \neq 0$.
$$
a^{\ast}_{\text{MAP}} = \argmax_{A} \log P(A | B = b)
$$
MLE and MAP Relationship
By applying Bayes’ theorem, we have
$$
\begin{align}
P(A | B = b) &= \frac{P(B = b | A)P(A)}{P(B = b)} \\
&\propto P(B = b|A) P(A) \
\end{align}
$$
Therefore, maximum a posteriori estimation could be expanded as
$$
\begin{align}
a^{\ast}_{\text{MAP}} &= \argmax_{A} P(A | B = b) \\
&= \argmax_{A} \log P(A | B = b) \\
&= \argmax_{A} \log \frac{P(B = b | A)P(A)}{P(B = b)} \\
&= \argmax_{A} \Big ( \log P(B = b | A) + \log P(A) - \log P(B = b) \Big) \\
&= \argmax_{A} \Big ( \log P(B = b | A) + \log P(A) \Big) \\
\end{align}
$$
If the prior probability $P(A)$ is uniform distribution, i.e., $P(A)$ is a constant, we further have
$$
\begin{align}
a^{\ast}_{\text{MAP}} &= \argmax_{A} P(A | B = b) \\
&= \argmax_{A} \Big ( \log P(B = b | A) + \log P(A) \Big) \\
&= \argmax_{A} \log P(B = b | A) \\
&= a^{\ast}_{\text{MLE}}
\end{align}
$$
Therefore, we could conclude that maximum likelihood estimation is a special case of maximum a posteriori estimation when the prior probability is uniform distribution.
Which One to Use
In optimization, maximum likelihood estimation and maximum a posteriori estimation, which one to use, really depends on the use cases. If we know the probability distribution for both the likelihood probability $P(B | A)$ and the prior probability $P(A)$, we can use maximum a posteriori estimation.
However, in many practical optimization problems, we actually don’t know the distribution for the prior probability $P(A)$. Therefore, applying maximum a posteriori estimation is not possible, and we can only apply maximum likelihood estimation.
For example, suppose we are going to find the optimal parameters for a model. In the model, we have parameter variables $\theta$ and data variables $X$. Given some training data $\{x_1, x_2, \cdots, x_N \}$, we want to find the most likely parameter $\theta^{\ast}$ of the model given the training data. Mathematically, it is essentially maximum a posteriori estimation and it is expressed as
$$
\begin{align}
\theta^{\ast} &= \argmax_{\theta} \prod_{i=1}^{N} P(\theta | X = x_i) \\
&= \argmax_{\theta} \log \prod_{i=1}^{N} P(\theta | X = x_i) \\
&= \argmax_{\theta} \sum_{i=1}^{N} \log P(\theta | X = x_i) \\
&= \argmax_{\theta} \sum_{i=1}^{N} \log \Big( P(X = x_i | \theta ) P(\theta) \Big) \\
&= \argmax_{\theta} \sum_{i=1}^{N} \Big( \log P(X = x_i | \theta ) + \log P(\theta) \Big)\\
&= \argmax_{\theta} \Bigg( \bigg( \sum_{i=1}^{N} \log P(X = x_i | \theta ) \bigg) + N \log P(\theta) \Bigg) \\
\end{align}
$$
As been discussed previously, because in many models, especially the conventional machine learning and deep learning models, we usually don’t know the distribution of $P(\theta)$, we cannot do maximum a posteriori estimation exactly. However, we can still do maximum likelihood estimation by assuming $P(\theta)$ is uniform distribution. Then
$$
\begin{align}
\theta^{\ast} &= \argmax_{\theta} \sum_{i=1}^{N} \log P(X = x_i | \theta ) \\
&= \argmax_{\theta} \prod_{i=1}^{N} P(X = x_i | \theta ) \\
\end{align}
$$
This is why we often see maximum likelihood estimation, rather than maximum a posteriori estimation, in conventional non-probabilistic machine learning and deep learning models.
Maximum Likelihood Estimation VS Maximum A Posteriori Estimation
https://leimao.github.io/blog/Maximum-Likelihood-Estimation-VS-Maximum-A-Posteriori-Estimation/