### Introduction

In machine learning, people often talked about cross entropy, KL divergence, and maximum likelihood together. These three things sort of have “equivalences” in solving many problems. In this blog post, I am going to derive their relationships for my own future references.

### Definition

#### Shannon Entropy

Given a distributions $p$ over a given variable $X$, it is defined as

Concretely, for continuous case,

and for discrete case,

#### Cross Entropy

Given two distributions $p$ and $q$ over a given variable $X$, the cross entropy is defined as

Concretely, for continuous case,

and for discrete case,

#### Kullback–Leibler Divergence

Kullback–Leibler Divergence (KL Divergence) is also called relative entropy. Given two distributions $p$ and $q$ over a given variable $X$, it is defined as

Concretely, for continuous case,

and for discrete case,

#### Maximum Likelihood Estimation

For unsupervised learning, given a dataset $\{x_1, x_2, \cdots, x_n\}$, we want to train a model with parameters $\theta$ so that the product of the likelihood for all the samples in the dataset is maximized.

We use $q_{\theta}(x_i)$ to denote the predicted likelihood $q(x_i|\theta)$ from the model for sample $x_i$ from the dataset. Concretely, we have the follow objective function

It is equivalent to optimize

or

Similarly, for supervised learning, given a dataset $\{(x_1, y_1), (x_2, y_2), \cdots, (x_n, y_n)\}$, we want to optimize

It is equivalent to optimize

or

### Relationships

#### Maximum Likelihood Estimation and Cross Entropy

In classification problems, we set $p$ as the distribution for the ground truth label for feature $x$, and $q_\theta$ as the distribution for the predicted label for feature $x$ from the model.

The ground truth distribution $p(y|x_i)$ would be a one-hot encoded vector where

For sample $(x_i, y_i)$ from the dataset, the cross entropy of the ground truth distribution and the predicted label distribution is

We use the sum of the cross entropy for all our samples from the dataset and use it as the loss function to train our model on the dataset

This loss function is sometimes called a log loss function. Thus the optimization goal is

This is exactly the same as the optimization goal of maximum likelihood estimation. Therefore, we say optimization using log loss in the classification problems is equivalent to do maximum likelihood estimation.

#### Cross Entropy and KL Divergence

It is not hard to derive the relationship between cross entropy and KL divergence.

#### Optimization Using Cross Entropy or KL Divergence

From the relationship between cross entropy and KL divergence, we know that

We could then rewrite our log loss

The optimization goal then becomes

Because $H_i(p)$ is independent of $\theta$

Therefore, in classification problems, optimization using the sum of cross entropy over all the training samples is equivalent to optimization using the sum of KL divergence over all the training samples.

We use cross entropy in practice because it is relatively easy to compute.