Maximum Likelihood Estimation VS Maximum A Posteriori Estimation

Introduction

In non-probabilistic machine learning, maximum likelihood estimation (MLE) is one of the most common methods for optimizing a model. In probabilistic machine learning, we often see maximum a posteriori estimation (MAP) rather than maximum likelihood estimation for optimizing a model.

In this blog post, I would like to discuss the connections between the MLE and MAP methods.

Bayes’ Theorem

Bayes’ theorem is stated mathematically as the following equation.

P(A|B)posterior=P(B|A)likelihoodP(A)priorP(B)marginal

where P(A|B) is the posterior probability, P(B|A) is the likelihood probability, P(A) and P(B) are prior probabilities, and P(B) is also often referred as the marginal probability as it is an marginalization of the joint probability P(A,B) over variable A.

If A is a continuous variable,

P(B)=AP(A,B)dA=AP(B|A)P(A)dA

If A is a discrete variable,

P(B)=AP(A,B)=AP(B|A)P(A)

Notice that P(B) is a constant with respect to the variable A, so we could safely say P(A|B) is proportional to P(B|A)P(A) with respect to the variable A. Mathematically,

P(A|B)P(B|A)P(A)

Maximum Likelihood Estimation (MLE)

Maximum likelihood estimation, as is stated in its name, maximizes the likelihood probability P(B|A) in Bayes’ theorem with respect to the variable A given the variable B is observed.

Mathematically, maximum likelihood estimation could be expressed as

aMLE=argmaxAP(B=b|A)

It is equivalent to optimizing in the log domain since P(B=b|A)0 and assuming P(B=b|A)0.

aMLE=argmaxAlogP(B=b|A)

Maximum A Posteriori Estimation (MAP)

Maximum a posteriori estimation, as is stated in its name, maximizes the posterior probability P(A|B) in Bayes’ theorem with respect to the variable A given the variable B is observed.

Mathematically, maximum a posteriori estimation could be expressed as

aMAP=argmaxAP(A|B=b)

It is equivalent to optimizing in the log domain since P(A|B=b)0 and assuming P(A|B=b)0.

aMAP=argmaxAlogP(A|B=b)

MLE and MAP Relationship

By applying Bayes’ theorem, we have

P(A|B=b)=P(B=b|A)P(A)P(B=b)P(B=b|A)P(A) 

Therefore, maximum a posteriori estimation could be expanded as

aMAP=argmaxAP(A|B=b)=argmaxAlogP(A|B=b)=argmaxAlogP(B=b|A)P(A)P(B=b)=argmaxA(logP(B=b|A)+logP(A)logP(B=b))=argmaxA(logP(B=b|A)+logP(A))

If the prior probability P(A) is uniform distribution, i.e., P(A) is a constant, we further have

aMAP=argmaxAP(A|B=b)=argmaxA(logP(B=b|A)+logP(A))=argmaxAlogP(B=b|A)=aMLE

Therefore, we could conclude that maximum likelihood estimation is a special case of maximum a posteriori estimation when the prior probability is uniform distribution.

Which One to Use

In optimization, maximum likelihood estimation and maximum a posteriori estimation, which one to use, really depends on the use cases. If we know the probability distribution for both the likelihood probability P(B|A) and the prior probability P(A), we can use maximum a posteriori estimation.

However, in many practical optimization problems, we actually don’t know the distribution for the prior probability P(A). Therefore, applying maximum a posteriori estimation is not possible, and we can only apply maximum likelihood estimation.

For example, suppose we are going to find the optimal parameters for a model. In the model, we have parameter variables θ and data variables X. Given some training data {x1,x2,,xN}, we want to find the most likely parameter θ of the model given the training data. Mathematically, it is essentially maximum a posteriori estimation and it is expressed as

θ=argmaxθi=1NP(θ|X=xi)=argmaxθlogi=1NP(θ|X=xi)=argmaxθi=1NlogP(θ|X=xi)=argmaxθi=1Nlog(P(X=xi|θ)P(θ))=argmaxθi=1N(logP(X=xi|θ)+logP(θ))=argmaxθ((i=1NlogP(X=xi|θ))+NlogP(θ))

As been discussed previously, because in many models, especially the conventional machine learning and deep learning models, we usually don’t know the distribution of P(θ), we cannot do maximum a posteriori estimation exactly. However, we can still do maximum likelihood estimation by assuming P(θ) is uniform distribution. Then

θ=argmaxθi=1NlogP(X=xi|θ)=argmaxθi=1NP(X=xi|θ)

This is why we often see maximum likelihood estimation, rather than maximum a posteriori estimation, in conventional non-probabilistic machine learning and deep learning models.

Maximum Likelihood Estimation VS Maximum A Posteriori Estimation

https://leimao.github.io/blog/Maximum-Likelihood-Estimation-VS-Maximum-A-Posteriori-Estimation/

Author

Lei Mao

Posted on

07-02-2021

Updated on

07-02-2021

Licensed under


Comments