Lei Mao bio photo

Lei Mao

Machine Learning, Artificial Intelligence, Computer Science.

Twitter Facebook LinkedIn GitHub   G. Scholar E-Mail RSS


In machine learning, people often talked about cross entropy, KL divergence, and maximum likelihood together. These three things sort of have “equivalences” in solving many problems. In this blog post, I am going to derive their relationships for my own future references.


Shannon Entropy

Given a distributions $p$ over a given variable $X$, it is defined as

\[H(p) = \mathbb{E}_p[-\log p]\]

Concretely, for continuous case,

\[\begin{aligned} H(p) &= - \int\limits_{X} p(x) \log p(x) dx \end{aligned}\]

and for discrete case,

\[\begin{aligned} H(p) &= - \sum\limits_{x \in X} p(x) \log p(x) \end{aligned}\]

Cross Entropy

Given two distributions $p$ and $q$ over a given variable $X$, the cross entropy is defined as

\[H(p,q) = \mathbb{E}_p[-\log q]\]

Concretely, for continuous case,

\[\begin{aligned} H(p,q) &= - \int\limits_{X} p(x) \log q(x) dx \end{aligned}\]

and for discrete case,

\[\begin{aligned} H(p,q) &= - \sum\limits_{x \in X} p(x) \log q(x) \end{aligned}\]

Kullback–Leibler Divergence

Kullback–Leibler Divergence (KL Divergence) is also called relative entropy. Given two distributions $p$ and $q$ over a given variable $X$, it is defined as

\[D_{\text{KL}} (p || q) = \mathbb{E}_p[-\log \frac{q}{p}]\]

Concretely, for continuous case,

\[D_{\text{KL}} (p || q) = - \int\limits_{X} p(x) \log \frac{q(x)}{p(x)} dx\]

and for discrete case,

\[D_{\text{KL}} (p || q) = - \sum\limits_{x \in X} p(x) \log \frac{q(x)}{p(x)}\]

Maximum Likelihood Estimation

For unsupervised learning, given a dataset $\{x_1, x_2, \cdots, x_n\}$, we want to train a model with parameters $\theta$ so that the product of the likelihood for all the samples in the dataset is maximized.

We use $q_{\theta}(x_i)$ to denote the predicted likelihood $q(x_i|\theta)$ from the model for sample $x_i$ from the dataset. Concretely, we have the follow objective function

\[\DeclareMathOperator*{\argmin}{argmin} \DeclareMathOperator*{\argmax}{argmax} \argmax_{\theta} \prod_{i=1}^{n} q_{\theta}(x_i)\]

It is equivalent to optimize

\[\argmax_{\theta} \sum_{i=1}^{n} \log q_{\theta}(x_i)\]


\[\argmin_{\theta} - \sum_{i=1}^{n} \log q_{\theta}(x_i)\]

Similarly, for supervised learning, given a dataset $\{(x_1, y_1), (x_2, y_2), \cdots, (x_n, y_n)\}$, we want to optimize

\[\argmax_{\theta} \prod_{i=1}^{n} q_{\theta}(y_i | x_i)\]

It is equivalent to optimize

\[\argmax_{\theta} \sum_{i=1}^{n} \log q_{\theta}(y_i | x_i)\]


\[\argmin_{\theta} - \sum_{i=1}^{n} \log q_{\theta}(y_i | x_i)\]


Maximum Likelihood Estimation and Cross Entropy

In classification problems, we set $p$ as the distribution for the ground truth label for feature $x$, and $q_\theta$ as the distribution for the predicted label for feature $x$ from the model.

The ground truth distribution $p(y|x_i)$ would be a one-hot encoded vector where

\[p(y|x_i) = \begin{cases} 1 & \text{if } y = y_i \\ 0 & \text{otherwise} \end{cases}\]

For sample $(x_i, y_i)$ from the dataset, the cross entropy of the ground truth distribution and the predicted label distribution is

\[\begin{aligned} H_{i}(p,q_{\theta}) &= - \sum\limits_{y \in Y} p(y|x_i) \log q_{\theta}(y|x_i) \\ &= - \log q_{\theta}(y_i|x_i) \\ \end{aligned}\]

We use the sum of the cross entropy for all our samples from the dataset and use it as the loss function to train our model on the dataset

\[\begin{aligned} L &= \sum_{i=1}^{n} H_i(p,q_{\theta}) \\ &= \sum_{i=1}^{n} - \log q_{\theta}(y_i|x_i) \\ &= - \sum_{i=1}^{n} \log q_{\theta}(y_i|x_i) \\ \end{aligned}\]

This loss function is sometimes called a log loss function. Thus the optimization goal is

\[\begin{aligned} \argmin_{\theta} L &= \argmin_{\theta} \sum_{i=1}^{n} H_i(p,q_{\theta}) \\ &= \argmin_{\theta} - \sum_{i=1}^{n} \log q_{\theta}(y_i | x_i) \\ \end{aligned}\]

This is exactly the same as the optimization goal of maximum likelihood estimation. Therefore, we say optimization using log loss in the classification problems is equivalent to do maximum likelihood estimation.

Cross Entropy and KL Divergence

It is not hard to derive the relationship between cross entropy and KL divergence.

\[\begin{aligned} D_{\text{KL}} (p || q) &= \mathbb{E}_p[-\log \frac{q}{p}] \\ &= \mathbb{E}_p[-\log q + \log p] \\ &= \mathbb{E}_p[-\log q] + \mathbb{E}_p[\log p] \\ &= \mathbb{E}_p[-\log q] - \mathbb{E}_p[-\log p] \\ &= H(p,q) - H(p) \\ \end{aligned}\]

Optimization Using Cross Entropy or KL Divergence

From the relationship between cross entropy and KL divergence, we know that

\[H(p,q) = D_{\text{KL}} (p || q) + H(p)\]

We could then rewrite our log loss

\[\begin{aligned} L &= \sum_{i=1}^{n} H_i(p,q_{\theta}) \\ &= \sum_{i=1}^{n} \big[D_{\text{KL},i} (p || q_{\theta}) + H_i(p)\big] \\ \end{aligned}\]

The optimization goal then becomes

\[\begin{aligned} \argmin_{\theta} L &= \argmin_{\theta} \sum_{i=1}^{n} H_i(p,q_{\theta}) \\ &= \argmin_{\theta} \sum_{i=1}^{n} \big[D_{\text{KL},i} (p || q_{\theta}) + H_i(p)\big] \\ \end{aligned}\]

Because $H_i(p)$ is independent of $\theta$

\[\begin{aligned} \argmin_{\theta} L &= \argmin_{\theta} \sum_{i=1}^{n} H_i(p,q_{\theta}) \\ &= \argmin_{\theta} \sum_{i=1}^{n} D_{\text{KL},i} (p || q_{\theta}) \\ \end{aligned}\]

Therefore, in classification problems, optimization using the sum of cross entropy over all the training samples is equivalent to optimization using the sum of KL divergence over all the training samples.

We use cross entropy in practice because it is relatively easy to compute.