### Introduction

In non-probabilistic machine learning, maximum likelihood estimation (MLE) is one of the most common methods for optimizing a model. In probabilistic machine learning, we often see maximum a posteriori estimation (MAP) rather than maximum likelihood estimation for optimizing a model.

In this blog post, I would like to discuss the connections between the MLE and MAP methods.

### Bayes’ Theorem

Bayes’ theorem is stated mathematically as the following equation.

\[\DeclareMathOperator*{\argmin}{argmin} \DeclareMathOperator*{\argmax}{argmax} \underbrace{P(A|B)}_\text{posterior} = \frac{\underbrace{P(B|A)}_\text{likelihood} \underbrace{P(A)}_\text{prior}}{\underbrace{P(B)}_\text{marginal}}\]where $P(A|B)$ is the *posterior* probability, $P(B|A)$ is the *likelihood* probability, $P(A)$ and $P(B)$ are *prior* probabilities, and $P(B)$ is also often referred as the *marginal* probability as it is an marginalization of the *joint* probability $P(A, B)$ over variable $A$.

If $A$ is a continuous variable,

\[P(B) = \int_{A}^{} P(A, B) d A = \int_{A}^{} P(B | A) P(A) d A\]If $A$ is a discrete variable,

\[P(B) = \sum_{A}^{} P(A, B) = \sum_{A}^{} P(B | A) P(A)\]Notice that $P(B)$ is a constant with respect to the variable $A$, so we could safely say $P(A|B)$ is proportional to $P(B|A) P(A)$ with respect to the variable $A$. Mathematically,

\[P(A|B) \propto P(B|A) P(A)\]### Maximum Likelihood Estimation (MLE)

Maximum likelihood estimation, as is stated in its name, maximizes the likelihood probability $P(B|A)$ in Bayes’ theorem with respect to the variable $A$ given the variable $B$ is observed.

Mathematically, maximum likelihood estimation could be expressed as

\[a^{\ast}_{\text{MLE}} = \argmax_{A} P(B = b | A)\]It is equivalent to optimizing in the log domain since $P(B = b | A) \geq 0$ and assuming $P(B = b | A) \neq 0$.

\[a^{\ast}_{\text{MLE}} = \argmax_{A} \log P(B = b | A)\]### Maximum A Posteriori Estimation (MAP)

Maximum a posteriori estimation, as is stated in its name, maximizes the posterior probability $P(A | B)$ in Bayes’ theorem with respect to the variable $A$ given the variable $B$ is observed.

Mathematically, maximum a posteriori estimation could be expressed as

\[a^{\ast}_{\text{MAP}} = \argmax_{A} P(A | B = b)\]It is equivalent to optimizing in the log domain since $P(A | B = b) \geq 0$ and assuming $P(A | B = b) \neq 0$.

\[a^{\ast}_{\text{MAP}} = \argmax_{A} \log P(A | B = b)\]### MLE and MAP Relationship

By applying Bayes’ theorem, we have

\[\begin{align} P(A | B = b) &= \frac{P(B = b | A)P(A)}{P(B = b)} \\ &\propto P(B = b|A) P(A) \\ \end{align}\]Therefore, maximum a posteriori estimation could be expanded as

\[\begin{align} a^{\ast}_{\text{MAP}} &= \argmax_{A} P(A | B = b) \\ &= \argmax_{A} \log P(A | B = b) \\ &= \argmax_{A} \log \frac{P(B = b | A)P(A)}{P(B = b)} \\ &= \argmax_{A} \Big ( \log P(B = b | A) + \log P(A) - \log P(B = b) \Big) \\ &= \argmax_{A} \Big ( \log P(B = b | A) + \log P(A) \Big) \\ \end{align}\]If the prior probability $P(A)$ is uniform distribution, i.e., $P(A)$ is a constant, we further have

\[\begin{align} a^{\ast}_{\text{MAP}} &= \argmax_{A} P(A | B = b) \\ &= \argmax_{A} \Big ( \log P(B = b | A) + \log P(A) \Big) \\ &= \argmax_{A} \log P(B = b | A) \\ &= a^{\ast}_{\text{MLE}} \end{align}\]Therefore, we could conclude that maximum likelihood estimation is a special case of maximum a posteriori estimation when the prior probability is uniform distribution.

### Which One to Use

In optimization, maximum likelihood estimation and maximum a posteriori estimation, which one to use, really depends on the use cases. If we know the probability distribution for both the likelihood probability $P(B | A)$ and the prior probability $P(A)$, we can use maximum a posteriori estimation.

However, in many practical optimization problems, we actually don’t know the distribution for the prior probability $P(A)$. Therefore, applying maximum a posteriori estimation is not possible, and we can only apply maximum likelihood estimation.

For example, suppose we are going to find the optimal parameters for a model. In the model, we have parameter variables $\theta$ and data variables $X$. Given some training data $\{x_1, x_2, \cdots, x_N \}$, we want to find the most likely parameter $\theta^{\ast}$ of the model given the training data. Mathematically, it is essentially maximum a posteriori estimation and it is expressed as

\[\begin{align} \theta^{\ast} &= \argmax_{\theta} \prod_{i=1}^{N} P(\theta | X = x_i) \\ &= \argmax_{\theta} \log \prod_{i=1}^{N} P(\theta | X = x_i) \\ &= \argmax_{\theta} \sum_{i=1}^{N} \log P(\theta | X = x_i) \\ &= \argmax_{\theta} \sum_{i=1}^{N} \log \Big( P(X = x_i | \theta ) P(\theta) \Big) \\ &= \argmax_{\theta} \sum_{i=1}^{N} \Big( \log P(X = x_i | \theta ) + \log P(\theta) \Big)\\ &= \argmax_{\theta} \Bigg( \bigg( \sum_{i=1}^{N} \log P(X = x_i | \theta ) \bigg) + N \log P(\theta) \Bigg) \\ \end{align}\]As been discussed previously, because in many models, especially the conventional machine learning and deep learning models, we usually don’t know the distribution of $P(\theta)$, we cannot do maximum a posteriori estimation exactly. However, we can still do maximum likelihood estimation by assuming $P(\theta)$ is uniform distribution. Then

\[\begin{align} \theta^{\ast} &= \argmax_{\theta} \sum_{i=1}^{N} \log P(X = x_i | \theta ) \\ &= \argmax_{\theta} \prod_{i=1}^{N} P(X = x_i | \theta ) \\ \end{align}\]This is why we often see maximum likelihood estimation, rather than maximum a posteriori estimation, in conventional non-probabilistic machine learning and deep learning models.