### Introduction

In one of my previous blog posts on cross entropy, KL divergence, and maximum likelihood estimation, I have shown the “equivalence” of these three things in optimization. Cross entropy loss has been widely used in most of the state-of-the-art machine learning classification models, mainly because optimizing it is equivalent to maximum likelihood estimation. However, there could be other losses used for classification problems.

In this blog post, I would like to discussed the log loss used for logistic regression, the cross entropy loss used for multi-class classification, and the sum of log loss used for multi-class classification.

### Prerequisites

The prerequisites of this blog post have been discussed heavily in my other blog posts. For completeness, I made a copy of them here.

#### Leibniz Integral Rule

Leibniz integral rule allows swapping positions of derivatives under certain circumstances.

\[\frac{d}{dx} \int_{a}^{b} f(x,t) dt = \int_{a}^{b} \frac{\partial}{\partial x} f(x,t) dt\]To check the quick proof of Leibniz integral rule, please check one my blog posts on this.

#### Derivatives of Expected Value

Based on Leibniz integral rule, we could also move the positions of derivatives inside the expected value or outside the expected value. For instance,

\[\begin{aligned} \frac{\partial}{\partial \theta} \mathbb{E}_{(x,y) \sim P(x,y)}\big[ \mathscr{L}_{\theta}(x,y) \big] &= \frac{\partial}{\partial \theta} \int\limits_{(x,y)} P(x, y) \mathscr{L}_{\theta}(x, y) d(xy) \\ &= \int\limits_{(x,y)} P(x, y) \frac{\partial}{\partial \theta} \mathscr{L}_{\theta}(x, y) d(xy) \\ &= \mathbb{E}_{(x,y) \sim P(x, y)}\big[ \frac{\partial}{\partial \theta} \mathscr{L}_{\theta}(x, y) \big] \end{aligned}\]If $\mathbb{E}_{(x,y) \sim P(x,y)}\big[ \mathscr{L}_{\theta}(x,y) \big]$ is the loss function, to compute the derivative of the entire loss with respect to the parameter $\theta$, it is equivalent to computing the derivatives of the losses contributed by each data point $\mathscr{L}_{\theta}(x, y)$ with respect to the parameter $\theta$ and followed by computing its expected value, $\mathbb{E}_{(x,y) \sim P(x, y)}\big[ \frac{\partial}{\partial \theta} \mathscr{L}_{\theta}(x, y) \big]$. Assuming all the data $(x, y)$ are uniformly distributed, it is simply just computing the derivative of the losses contributed by each data point $\mathscr{L}_{\theta}(x, y)$ with respect to the parameter $\theta$ and followed by taking their average. This is also how modern deep learning framework, such as TensorFlow and PyTorch, works.

### Log Loss for Logistic Regression

Log loss has been used for logistic regression for a long time. Given $(x, y)$, where $x$ is the input, $y$ is the label for $x$, our goal is to develop a model $f_{\theta}$, where $\theta$ are the model parameters. such that $\tilde{y} = f_{\theta}(x)$ is as close to $y$ as possible. For logistic regression, usually $y = 0$ or $1$. $y = 1$ means $x$ belongs to certain class, whereas $y = 0$ means $x$ does not belong to certain class.

Usually, logistic regression will compute the logit $z$ for the input $x$, where $z = g_{\theta}(x)$ followed by computing the sigmoid activation for $z$ to get the estimate of $y$, $\tilde{y}$, in the range of $(0,1)$. Concretely,

\[\begin{align} P(y = 1| x) &= \tilde{y} \\ &= \sigma(z) \\ &= \frac{e^{z}}{e^{z} + 1} \\ &= \frac{1}{1 + e^{-z}} \\ \end{align}\]The loss function used for fitting is called log loss, which we would see later that it is actually binary cross entropy loss.

\[\begin{align} \mathscr{L}_{\theta}(x,y) = -y \log(\tilde{y}) - (1-y) \log(1 - \tilde{y}) \\ \end{align}\]We further compute the derivative of this log loss with respect to the logit $z$.

\[\begin{align} \frac{\partial}{\partial \tilde{y}} \mathscr{L}_{\theta}(x,y) &= \frac{\partial}{\partial \tilde{y}} \big[ -y \log(\tilde{y}) - (1-y) \log(1 - \tilde{y}) \big] \\ &= -\frac{y}{\tilde{y}} - \frac{1-y}{1 - \tilde{y}} (-1) \\ &= -\frac{y}{\tilde{y}} + \frac{1-y}{1 - \tilde{y}} \\ &= \frac{\tilde{y} - y}{\tilde{y}(1 - \tilde{y})} \\ \end{align}\] \[\begin{align} \frac{\partial}{\partial z} \tilde{y} &= \frac{\partial}{\partial z} \frac{1}{1 + e^{-z}} \\ &= (-1) (1 + e^{-z})^{-2} e^{-z} (-1) \\ &= \frac{1}{1 + e^{-z}} \frac{e^{-z}}{1 + e^{-z}} \\ &= \frac{1}{1 + e^{-z}} \big( 1- \frac{1}{1 + e^{-z}} \big) \\ &= \tilde{y} (1 - \tilde{y}) \\ \end{align}\]Therefore,

\[\begin{align} \frac{\partial}{\partial z} \mathscr{L}_{\theta}(x,y) &= \frac{\partial}{\partial \tilde{y}} \mathscr{L}_{\theta}(x,y) \frac{\partial}{\partial z} \tilde{y} \\ &= \frac{\tilde{y} - y}{\tilde{y}(1 - \tilde{y})} \tilde{y} (1 - \tilde{y}) \\ &= \tilde{y} - y \end{align}\]The interpretation is very simple. If $\tilde{y} > y$, $\frac{\partial}{\partial z} \mathscr{L}_{\theta}(x,y) > 0$, the gradient descent wants to make $z$ smaller. Once $z$ is smaller, $\tilde{y}$ will be smaller and its deviation from $y$ will become smaller. Similarly, If $\tilde{y} < y$, $\frac{\partial}{\partial z} \mathscr{L}_{\theta}(x,y) < 0$, the gradient descent wants to make $z$ larger. Once $z$ is larger, $\tilde{y}$ will be larger and its deviation from $y$ will become smaller.

To update the model parameters $\theta$, we would need to compute the derivatives with respect to $\theta$.

\[\begin{align} \frac{\partial}{\partial \theta} \mathscr{L}_{\theta}(x,y) &= \frac{\partial}{\partial z} \mathscr{L}_{\theta}(x,y) \frac{\partial}{\partial \theta} z \\ &= (\tilde{y} - y) \frac{\partial}{\partial \theta} z \end{align}\]### Cross Entropy Loss for Multi-Class Classification

Cross entropy loss has been widely used for classification problems in deep learning. Given $(x, \mathbf{y})$, where $x$ is the input, $\mathbf{y}$ is the label for $x$. If $\mathbf{y}$ is an one-hot vector of size $n$, and $\mathbf{y}_i = 1$, our goal is to develop a model $f_{\theta}$, where $\theta$ are the model parameters, such that $\tilde{\mathbf{y}}_i = f_{\theta}(x)$ is as close to $\mathbf{y}_i$ as possible. Note that the motivation of using cross entropy loss behind is maximum likelihood estimation which might not be obvious to see. But I have discussed in my previous blog post.

Usually, multi-class classification will compute the logits $\mathbf{z}$ for the input $x$, where $\mathbf{z} = g_{\theta}(x)$ followed by computing the softmax activation for $\mathbf{z}$ to get the estimate of $\mathbf{y}$, $\tilde{\mathbf{y}}$, where $\sum_{i=1}^{n} \tilde{\mathbf{y}}_i = 1$ and $\tilde{\mathbf{y}}_i \in [0, 1]$ for $i \in [1, n]$. Concretely,

\[\begin{align} P(\mathbf{y}_i = 1| x) &= \tilde{\mathbf{y}}_i \\ &= \frac{e^{\mathbf{z}_i}}{\sum_{j=1}^{n} e^{\mathbf{z}_j}} \\ \end{align}\]The loss function used for fitting is called softmax loss.

\[\begin{align} \mathscr{L}_{\theta}(x,\mathbf{y}) = \sum_{i=1}^{n} - \mathbf{y}_i \log(\tilde{\mathbf{y}}_i) \\ \end{align}\]We further compute the derivative of this cross entropy loss with respect to the logits $\mathbf{z}$.

\[\begin{align} \frac{\partial}{\partial \tilde{\mathbf{y}}_i} \mathscr{L}_{\theta}(x,\mathbf{y}) &= \frac{\partial}{\partial \tilde{\mathbf{y}}_i} \sum_{j=1}^{n} - \mathbf{y}_j \log(\tilde{\mathbf{y}}_j) \\ &= -\frac{\mathbf{y}_i}{\tilde{\mathbf{y}}_i} \\ \end{align}\] \[\begin{align} \frac{\partial}{\partial \mathbf{z}_i} \tilde{\mathbf{y}}_i &= \frac{\partial}{\partial \mathbf{z}_i} \frac{e^{\mathbf{z}_i}}{\sum_{j=1}^{n} e^{\mathbf{z}_j}} \\ &= \frac{e^{\mathbf{z}_i} \sum_{j=1}^{n} e^{\mathbf{z}_j} - e^{\mathbf{z}_i} e^{\mathbf{z}_i} }{\big(\sum_{j=1}^{n} e^{\mathbf{z}_j}\big)^2} \\ &= \frac{e^{\mathbf{z}_i}}{\sum_{j=1}^{n} e^{\mathbf{z}_j}} \frac{\sum_{j=1}^{n} e^{\mathbf{z}_j} - e^{\mathbf{z}_i}}{\sum_{j=1}^{n} e^{\mathbf{z}_j}} \\ &= \frac{e^{\mathbf{z}_i}}{\sum_{j=1}^{n} e^{\mathbf{z}_j}} \bigg(1- \frac{e^{\mathbf{z}_i}}{\sum_{j=1}^{n} e^{\mathbf{z}_j}} \bigg) \\ &= \tilde{\mathbf{y}}_i (1- \tilde{\mathbf{y}}_i) \\ \end{align}\] \[\begin{align} \frac{\partial}{\partial \mathbf{z}_k} \tilde{\mathbf{y}}_i &= \frac{\partial}{\partial \mathbf{z}_k} \frac{e^{\mathbf{z}_i}}{\sum_{j=1}^{n} e^{\mathbf{z}_j}} \\ &= e^{\mathbf{z}_i} (-1) \big( \sum_{j=1}^{n} e^{\mathbf{z}_j} \big)^{-2} e^{\mathbf{z}_k} \\ &= - \frac{e^{\mathbf{z}_i}}{\sum_{j=1}^{n} e^{\mathbf{z}_j}} \frac{e^{\mathbf{z}_k}}{\sum_{j=1}^{n} e^{\mathbf{z}_j}} \\ &= - \tilde{\mathbf{y}}_i \tilde{\mathbf{y}}_k \\ \end{align}\]Therefore,

\[\begin{align} \frac{\partial}{\partial \mathbf{z}_k} \mathscr{L}_{\theta}(x,\mathbf{y}) &= \sum_{i=1}^{n} \frac{\partial}{\partial \tilde{\mathbf{y}}_i} \mathscr{L}_{\theta}(x,\mathbf{y}) \frac{\partial}{\partial \mathbf{z}_k} \tilde{\mathbf{y}}_i \\ &= \frac{\partial}{\partial \tilde{\mathbf{y}}_k} \mathscr{L}_{\theta}(x,\mathbf{y}) \frac{\partial}{\partial \mathbf{z}_k} \tilde{\mathbf{y}}_k + \sum_{i \neq k}^{} \frac{\partial}{\partial \tilde{\mathbf{y}}_i} \mathscr{L}_{\theta}(x,\mathbf{y}) \frac{\partial}{\partial \mathbf{z}_k} \tilde{\mathbf{y}}_i \\ &= -\frac{\mathbf{y}_k}{\tilde{\mathbf{y}}_k} \tilde{\mathbf{y}}_k (1- \tilde{\mathbf{y}}_k) + \sum_{i \neq k}^{} \big( -\frac{\mathbf{y}_i}{\tilde{\mathbf{y}}_i} \big) \big( - \tilde{\mathbf{y}}_i \tilde{\mathbf{y}}_k \big) \\ &= -\mathbf{y}_k ( 1- \tilde{\mathbf{y}}_k ) + \sum_{i \neq k}^{} \big( -\frac{\mathbf{y}_i}{\tilde{\mathbf{y}}_i} \big) \big( - \tilde{\mathbf{y}}_i \tilde{\mathbf{y}}_k \big) \\ &= \tilde{\mathbf{y}}_k \mathbf{y}_k - \mathbf{y}_k + \sum_{i \neq k}^{} \mathbf{y}_i \tilde{\mathbf{y}}_k \\ &= \tilde{\mathbf{y}}_k \mathbf{y}_k - \mathbf{y}_k + \tilde{\mathbf{y}}_k \sum_{i \neq k}^{} \mathbf{y}_i \\ &= \tilde{\mathbf{y}}_k \mathbf{y}_k - \mathbf{y}_k + \tilde{\mathbf{y}}_k (1 - \mathbf{y}_k) \\ &= \tilde{\mathbf{y}}_k - \mathbf{y}_k \\ \end{align}\]The interpretation is also very simple. If $\tilde{\mathbf{y}}_k > \mathbf{y}_k$, $\frac{\partial}{\partial \mathbf{z}_k} \mathscr{L}_{\theta}(x,\mathbf{y}) > 0$, the gradient descent wants to make $\mathbf{z}_k$ smaller. Once $\mathbf{z}_k$ is smaller, $\tilde{\mathbf{y}}_k$ will be smaller and its deviation from $\mathbf{y}_k$ will become smaller. Similarly, If $\tilde{\mathbf{y}}_k < \mathbf{y}_k$, $\frac{\partial}{\partial \mathbf{z}_k} \mathscr{L}_{\theta}(x,\mathbf{y}) < 0$, the gradient descent wants to make $\mathbf{z}_k$ larger. Once $\mathbf{z}_k$ is larger, $\tilde{\mathbf{y}}_k$ will be larger and its deviation from $\mathbf{y}_k$ will become smaller.

To update the model parameters $\theta$, we would need to compute the derivatives with respect to $\theta$.

\[\begin{align} \frac{\partial}{\partial \theta} \mathscr{L}_{\theta}(x,\mathbf{y}) &= \sum_{k=1}^{n} \frac{\partial}{\partial \mathbf{z}_k} \mathscr{L}_{\theta}(x,\mathbf{y}) \frac{\partial}{\partial \theta} \mathbf{z}_k \\ &= \sum_{k=1}^{n} (\tilde{\mathbf{y}}_k - \mathbf{y}_k) \frac{\partial}{\partial \theta} \mathbf{z}_k \\ \end{align}\]### Cross Entropy Loss for Multi-Class Classification VS Log Loss for Logistic Regression

If we have $n = 2$ for cross entropy loss and compare it with log loss, we would immediately see that the form of log loss is exactly the same to binary cross entropy loss, and log loss for logistic regression is a special case for cross entropy loss for multi-class classification where $n = 2$ and the logit $z$ for the negative class ($y = 0$) is a fixed value $0$.

So if we have a binary classification problem, what is the difference between modeling it using log loss for logistic regression and cross entropy loss for binary classification, assuming the learning rate and other hyperparameters are the same? We could see that the gradients with respect to the positive logit $z$ for both models are always the same. However, when it comes to updating the model parameters $\theta$, the gradients for the two models would be different, because one model has only one logit whereas the other model has two logits. Therefore, although the two models are analogous, it is incorrect to say these two models are exactly the same.

One may ask which model is better for a binary classification problem. It is hard to say. But one thing that is for sure is that the binary cross entropy model is relatively easy to overfit. The reasons are the followings:

- The binary cross entropy model has more parameters compared to the logistic regression.
- The binary cross entropy model would try to adjust the positive and negative logits simultaneously whereas the logistic regression would only adjust one logit and the other hidden logit is always $0$, resulting the difference between two logits larger in the binary cross entropy model much larger than that in the logistic regression model.

To prevent overfitting, we could use label smoothing for cross entropy loss. I have discussed it previously and I am not going to elaborate it here.

### Sum of Log Loss for Multi-Class Classification

While it might be rare, we could treat a multi-class classification problem as multiple one-vs-all classifications, each of which is a logistic regression. Given $(x, \mathbf{y})$, where $x$ is the input, $\mathbf{y}$ is the label for $x$. Our goal is to develop a model $f_{\theta}$, where $\theta$ are the model parameters, such that $\tilde{\mathbf{y}}_j = f_{\theta}(x)$ is as close to $\mathbf{y}_j$ as possible, for all $j \in [1, n]$. Note that this is different from cross entropy loss for multi-class classification.

Usually, multi-class classification will compute the logits $\mathbf{z}$ for the input $x$, where $\mathbf{z} = g_{\theta}(x)$ followed by computing the sigmoid activation for each logit of $\mathbf{z}$ to get the estimate of $\mathbf{y}$, $\tilde{\mathbf{y}}$, where notably $\sum_{i=1}^{n} \tilde{\mathbf{y}}_i$ and $\sum_{i=1}^{n} \mathbf{y}_i$ do not have to equal to 1, and $\tilde{\mathbf{y}}_i \in [0, 1]$ for $i \in [1, n]$. Concretely,

\[\begin{align} P(\mathbf{y}_i = 1| x) &= \tilde{\mathbf{y}}_i \\ &= \sigma(\mathbf{z}_i) \\ &= \frac{e^{\mathbf{z}_i}}{e^{\mathbf{z}_i} + 1} \\ &= \frac{1}{1 + e^{-\mathbf{z}_i}} \\ \end{align}\]The loss function used for fitting is called the sum of log loss. The classification of each class could be seen as mutually exclusive. TensorFlow has an implementation for this loss, which they call it tf.nn.sigmoid_cross_entropy_with_logits, as well.

\[\begin{align} \mathscr{L}_{\theta}(x,\mathbf{y}) = \sum_{i=1}^{n} - \mathbf{y}_i \log(\tilde{\mathbf{y}}_i) - (1-\mathbf{y}_i) \log(1 - \tilde{\mathbf{y}}_i) \\ \end{align}\]We further compute the derivative of this sum of log loss with respect to the logits $\mathbf{z}$. Because it is almost exactly the same to the derivative of log loss with respect to the logits $z$, we skipped some details.

\[\begin{align} \frac{\partial}{\partial \tilde{\mathbf{y}}_i} \mathscr{L}_{\theta}(x,\mathbf{y}) &= \frac{\partial}{\partial \tilde{\mathbf{y}}_i} \sum_{i=1}^{n} - \mathbf{y}_i \log(\tilde{\mathbf{y}}_i) - (1-\mathbf{y}_i) \log(1 - \tilde{\mathbf{y}}_i) \\ &= \frac{\tilde{\mathbf{y}}_i - \mathbf{y}_i}{\tilde{\mathbf{y}}_i(1 - \tilde{\mathbf{y}}_i)} \\ \end{align}\] \[\begin{align} \frac{\partial}{\partial \mathbf{z}_i} \tilde{\mathbf{y}}_i &= \frac{\partial}{\partial \mathbf{z}_i} \frac{1}{1 + e^{-\mathbf{z}_i}} \\ &= \tilde{\mathbf{y}}_i(1 - \tilde{\mathbf{y}}_i) \\ \end{align}\]Therefore,

\[\begin{align} \frac{\partial}{\partial \mathbf{z}_k} \mathscr{L}_{\theta}(x,\mathbf{y}) &= \frac{\partial}{\partial \tilde{\mathbf{y}}_k} \mathscr{L}_{\theta}(x,\mathbf{y}) \frac{\partial}{\partial \mathbf{z}_k} \tilde{\mathbf{y}}_k \\ &= \tilde{\mathbf{y}}_k - \mathbf{y}_k \\ \end{align}\]The interpretation is the same as cross entropy loss for multi-class classification. If $\tilde{\mathbf{y}}_k > \mathbf{y}_k$, $\frac{\partial}{\partial \mathbf{z}_k} \mathscr{L}_{\theta}(x,\mathbf{y}) > 0$, the gradient descent wants to make $\mathbf{z}_k$ smaller. Once $\mathbf{z}_k$ is smaller, $\tilde{\mathbf{y}}_k$ will be smaller and its deviation from $\mathbf{y}_k$ will become smaller. Similarly, If $\tilde{\mathbf{y}}_k < \mathbf{y}_k$, $\frac{\partial}{\partial \mathbf{z}_k} \mathscr{L}_{\theta}(x,\mathbf{y}) < 0$, the gradient descent wants to make $\mathbf{z}_k$ larger. Once $\mathbf{z}_k$ is larger, $\tilde{\mathbf{y}}_k$ will be larger and its deviation from $\mathbf{y}_k$ will become smaller.

To update the model parameters $\theta$, we would need to compute the derivatives with respect to $\theta$.

\[\begin{align} \frac{\partial}{\partial \theta} \mathscr{L}_{\theta}(x,\mathbf{y}) &= \sum_{k=1}^{n} \frac{\partial}{\partial \mathbf{z}_k} \mathscr{L}_{\theta}(x,\mathbf{y}) \frac{\partial}{\partial \theta} \mathbf{z}_k \\ &= \sum_{k=1}^{n} (\tilde{\mathbf{y}}_k - \mathbf{y}_k) \frac{\partial}{\partial \theta} \mathbf{z}_k \\ \end{align}\]The formula is exactly the same as the one used for cross entropy loss for multi-class classification.

### Cross Entropy Loss for Multi-Class Classification VS Sum of Log Loss for Multi-Class Classification

Because we have seen that the gradient formula of cross entropy loss and sum of log loss are exactly the same, we wonder if there is any difference between the two.

The answer is there is difference between the two, even if both models are doing multi-class classifications that there is only one label for one input. The clues lie in the values of $\tilde{\mathbf{y}}$.

Assuming $\mathbf{y}$ is an one-hot encoded vector, so $\sum_{i=1}^{n} \mathbf{y}_i = 1$. In cross entropy loss, $\sum_{i=1}^{n} \tilde{\mathbf{y}}_i = 1$, whereas in sum of log loss, $\sum_{i=1}^{n} \tilde{\mathbf{y}}_i \neq 1$.

We further assume $\mathbf{z}$ is the same for both models. When $\mathbf{z}_i > \log{\frac{1}{n-1}}$ for $i \in [1, n]$, it is easy to see that the $\tilde{\mathbf{y}}_i$ in the cross entropy loss model is smaller than the $\tilde{\mathbf{y}}_i$ in the sum of log loss model, for $i \in [1, n]$.

This means the gradient update gets more incentives from the sum of log loss model than the cross entropy loss model. What does it further mean? When the input has only one label as we already assumed that $\mathbf{y}$ is an one-hot encoded vector, assuming $n$ is large, the sum of log loss model got more incentives from the negative classes than the cross entropy loss model, which will weaken the learning of positive class.

Therefore, if all the assumptions above are true, we should use cross entropy loss for learning a single-label multi-class classification, instead of using sum of log loss.

### Multiple Labels for Multi-Class Classification

What if there are multiple labels for the multi-class classification. In the sum of log loss model, we could prepare labels $\mathbf{y}_i = 1$, where $i$ belongs to the label classes. In the cross entropy loss model, due to the limitation of cross entropy model, an intuitive approach is to, assuming there are $k$ labels for one input, make the labels $\mathbf{y}_i = \frac{1}{k}$, where $i$ belongs to the label classes, and $\sum_{i=1}^{n}\mathbf{y}_i = 1$ is not changed.

This time, the cross entropy loss is inferior compared to the sum of log loss model, because the model will not learn predicting confidently, especially when the number of labels $k$ is large. This is mainly restricted by the softmax activation function. In the sum of log loss model, the incentives of learn a positive class does not change as if it is still learning a single-label classification problem.

OK, how about this. Given an input that has $k$ labels, instead of using one data point, we prepare $k$ data points whose inputs are exactly the same and the labels are one-hot labels representing each of the $k$ labels. It is actually equivalent to have one data point whose labels $\mathbf{y}_i = \frac{1}{k}$, where $i$ belongs to the label classes. I would leave it the user to find it out why this is the case. The hint is to use the expected value of the derivatives.

Therefore, we should use sum of log loss for learning a multi-label multi-class classification, instead of using cross entropy loss.