# Maximum Likelihood Estimation VS Maximum A Posteriori Estimation

## Introduction

In non-probabilistic machine learning, maximum likelihood estimation (MLE) is one of the most common methods for optimizing a model. In probabilistic machine learning, we often see maximum a posteriori estimation (MAP) rather than maximum likelihood estimation for optimizing a model.

In this blog post, I would like to discuss the connections between the MLE and MAP methods.

## Bayes’ Theorem

Bayes’ theorem is stated mathematically as the following equation.

$$\DeclareMathOperator*{\argmin}{argmin} \DeclareMathOperator*{\argmax}{argmax} \underbrace{P(A|B)}_\text{posterior} = \frac{\underbrace{P(B|A)}_\text{likelihood} \underbrace{P(A)}_\text{prior}}{\underbrace{P(B)}_\text{marginal}}$$

where $P(A|B)$ is the posterior probability, $P(B|A)$ is the likelihood probability, $P(A)$ and $P(B)$ are prior probabilities, and $P(B)$ is also often referred as the marginal probability as it is an marginalization of the joint probability $P(A, B)$ over variable $A$.

If $A$ is a continuous variable,

$$P(B) = \int_{A}^{} P(A, B) d A = \int_{A}^{} P(B | A) P(A) d A$$

If $A$ is a discrete variable,

$$P(B) = \sum_{A}^{} P(A, B) = \sum_{A}^{} P(B | A) P(A)$$

Notice that $P(B)$ is a constant with respect to the variable $A$, so we could safely say $P(A|B)$ is proportional to $P(B|A) P(A)$ with respect to the variable $A$. Mathematically,

$$P(A|B) \propto P(B|A) P(A)$$

## Maximum Likelihood Estimation (MLE)

Maximum likelihood estimation, as is stated in its name, maximizes the likelihood probability $P(B|A)$ in Bayes’ theorem with respect to the variable $A$ given the variable $B$ is observed.

Mathematically, maximum likelihood estimation could be expressed as

$$a^{\ast}_{\text{MLE}} = \argmax_{A} P(B = b | A)$$

It is equivalent to optimizing in the log domain since $P(B = b | A) \geq 0$ and assuming $P(B = b | A) \neq 0$.

$$a^{\ast}_{\text{MLE}} = \argmax_{A} \log P(B = b | A)$$

## Maximum A Posteriori Estimation (MAP)

Maximum a posteriori estimation, as is stated in its name, maximizes the posterior probability $P(A | B)$ in Bayes’ theorem with respect to the variable $A$ given the variable $B$ is observed.

Mathematically, maximum a posteriori estimation could be expressed as

$$a^{\ast}_{\text{MAP}} = \argmax_{A} P(A | B = b)$$

It is equivalent to optimizing in the log domain since $P(A | B = b) \geq 0$ and assuming $P(A | B = b) \neq 0$.

$$a^{\ast}_{\text{MAP}} = \argmax_{A} \log P(A | B = b)$$

## MLE and MAP Relationship

By applying Bayes’ theorem, we have

\begin{align} P(A | B = b) &= \frac{P(B = b | A)P(A)}{P(B = b)} \\ &\propto P(B = b|A) P(A) \ \end{align}

Therefore, maximum a posteriori estimation could be expanded as

\begin{align} a^{\ast}_{\text{MAP}} &= \argmax_{A} P(A | B = b) \\ &= \argmax_{A} \log P(A | B = b) \\ &= \argmax_{A} \log \frac{P(B = b | A)P(A)}{P(B = b)} \\ &= \argmax_{A} \Big ( \log P(B = b | A) + \log P(A) - \log P(B = b) \Big) \\ &= \argmax_{A} \Big ( \log P(B = b | A) + \log P(A) \Big) \\ \end{align}

If the prior probability $P(A)$ is uniform distribution, i.e., $P(A)$ is a constant, we further have

\begin{align} a^{\ast}_{\text{MAP}} &= \argmax_{A} P(A | B = b) \\ &= \argmax_{A} \Big ( \log P(B = b | A) + \log P(A) \Big) \\ &= \argmax_{A} \log P(B = b | A) \\ &= a^{\ast}_{\text{MLE}} \end{align}

Therefore, we could conclude that maximum likelihood estimation is a special case of maximum a posteriori estimation when the prior probability is uniform distribution.

## Which One to Use

In optimization, maximum likelihood estimation and maximum a posteriori estimation, which one to use, really depends on the use cases. If we know the probability distribution for both the likelihood probability $P(B | A)$ and the prior probability $P(A)$, we can use maximum a posteriori estimation.

However, in many practical optimization problems, we actually don’t know the distribution for the prior probability $P(A)$. Therefore, applying maximum a posteriori estimation is not possible, and we can only apply maximum likelihood estimation.

For example, suppose we are going to find the optimal parameters for a model. In the model, we have parameter variables $\theta$ and data variables $X$. Given some training data $\{x_1, x_2, \cdots, x_N \}$, we want to find the most likely parameter $\theta^{\ast}$ of the model given the training data. Mathematically, it is essentially maximum a posteriori estimation and it is expressed as

\begin{align} \theta^{\ast} &= \argmax_{\theta} \prod_{i=1}^{N} P(\theta | X = x_i) \\ &= \argmax_{\theta} \log \prod_{i=1}^{N} P(\theta | X = x_i) \\ &= \argmax_{\theta} \sum_{i=1}^{N} \log P(\theta | X = x_i) \\ &= \argmax_{\theta} \sum_{i=1}^{N} \log \Big( P(X = x_i | \theta ) P(\theta) \Big) \\ &= \argmax_{\theta} \sum_{i=1}^{N} \Big( \log P(X = x_i | \theta ) + \log P(\theta) \Big)\\ &= \argmax_{\theta} \Bigg( \bigg( \sum_{i=1}^{N} \log P(X = x_i | \theta ) \bigg) + N \log P(\theta) \Bigg) \\ \end{align}

As been discussed previously, because in many models, especially the conventional machine learning and deep learning models, we usually don’t know the distribution of $P(\theta)$, we cannot do maximum a posteriori estimation exactly. However, we can still do maximum likelihood estimation by assuming $P(\theta)$ is uniform distribution. Then

\begin{align} \theta^{\ast} &= \argmax_{\theta} \sum_{i=1}^{N} \log P(X = x_i | \theta ) \\ &= \argmax_{\theta} \prod_{i=1}^{N} P(X = x_i | \theta ) \\ \end{align}

This is why we often see maximum likelihood estimation, rather than maximum a posteriori estimation, in conventional non-probabilistic machine learning and deep learning models.

Maximum Likelihood Estimation VS Maximum A Posteriori Estimation

https://leimao.github.io/blog/Maximum-Likelihood-Estimation-VS-Maximum-A-Posteriori-Estimation/

Lei Mao

07-02-2021

07-02-2021