### Introduction

In machine learning classification tasks, if you have an imbalanced training set and apply the training set directly for training, the overall accuracy might be good, but for some minority classes, their accuracy might be bad because they are overlooked during training. For example, in a binary classification task, you have 99 negative examples and 1 positive example in the training set, you model will likely try to learn how to classify the 99 negative samples correctly with extremely high confidence but the only one positive example will not receive attention in training and thus is likely to be classified incorrectly, or in a better scenario correctly with low confidence. What if people only care about the positive examples, even if you got 99% overall accuracy and 100% accuracy for the negative examples, the failure of positive example classification could hardly be tolerated. An ordinary way to overcome this problem is to do sampling to balance the dataset. If you have fewer samples in some classes, you sample them, or duplicate them so that the classes are balanced.

In object detection tasks, an imbalanced training set problem is more significant. Given an image, the object detection algorithms usually have to propose a good number of regions in which potential objects might sit. In R-CNN and Fast R-CNN algorithms, the number of regions proposed is limited intentionally to several thousand. In Faster R-CNN models and other models with CNN region proposal mechanisms, the number of regions proposed could be as high as several hundred thousand. Of course, most of the regions proposed are negative examples where there is no object inside. So this class imbalanced problem should definitely be addressed in object detection. In R-CNN and Fast R-CNN, because the model is not end-to-end and it consists of several distinct models, the class imbalanced problem could be solved by sampling more minor class samples or removing major class samples. However, in end-to-end models, sampling to balance the classes could not be easily achieved to the best of my knowledge.

Mathematically, sampling is equivalent to adding weights to samples. For example, in some television art competitions, the professional judges and all the audience vote to decide which player is the winner. Because the number of audiences is way larger than the number of professional judges, we should assign more “weights” to the votes from professional judges to reflect their professional judge on the art. It could be that the one vote from a professional judge is equivalent to one thousand votes from the ordinary audience. Facebook AI research tried to solve the class imbalance problem using a similar way.

### Theory

#### Cross Entropy Loss

Normally, we use sigmoid function for binary classification and Softmax function for multi-class classification to calculate the probability of the sample being certain class. The loss function used, regardless if it is a binary classification or multi-class classification, is usually cross entropy loss.

The mathematical expression for the discrete version of cross entropy is:

\[H(p,q) = -\sum_{i=1}^{n}p_i\log{q_i}\]Where $n$ is the number of all possible discretized distribution bins, $p_i$ is the probability of of being in bin $i$ in distribution $p$ and $q_i$ is the probability of of being in bin $i$ in distribution $q$. It should also be noted that cross entropy is not symmetric, i.e., $H(p,q) \neq H(q,p)$.

When this cross entropy used as loss function in classification problems, $p_i$ has to be the ground truth probability of the sample being as class $i$ and $q_i$ has to be the inferred probability of the sample being as class $i$ from the neural network, because all the $p_i$s except one of them are zero, all the $q_i$s are non-zero due to the nature of sigmoid or Softmax function, and we could not do $\log(0)$.

It is also not hard to find out that the log loss we normally used in binary classification is actually a special case of cross entropy loss where the number of classes $n=2$:

\[L = -\big[y\log(p) + (1-y)\log(1-p)\big]\]It is also not hard to find out that the above expression is equivalent to the below expression:

\[L = \begin{cases} -\log(p) & \text{if } y = 1 \\ -\log(1-p) & \text{otherwise} \end{cases}\]More more concisely, $L = -\log(p_t)$ where

\[p_t = \begin{cases} p & \text{if } y = 1 \\ 1-p & \text{otherwise} \end{cases}\]#### Focal Loss

Facebook AI research added a weighted term in front of the cross entropy loss in paper “Focal Loss for Dense Object Detection”. They called this loss “focal loss”.

Formally, the focal loss is expressed as follows:

\[L = -\alpha_t(1-p_t)^\gamma \log(p_t)\]Where $\gamma$ is a prefixed positive scala value and

\[\alpha_t = \begin{cases} \alpha & \text{if } y = 1 \\ 1-\alpha & \text{otherwise} \end{cases}\]where $\alpha$ is a prefixed value between 0 and 1 to balance the positive labeled samples and negative labeled samples, and it is one of the most common ways to balance the classes.

We see that the weight term $\alpha_t(1-p_t)^\gamma$ in addition to the cross entropy loss is dependent on the value of $p_t$. When $p_t$ is larger the weight is smaller, when $p_t$ is smaller the weight is larger.

This reminds me of the boosting algorithm where the previously incorrectly classified examples will receive more weights but in a different context.

### Practice

#### Focal Loss in Object Detection

In end-to-end object detection, the region proposed has way more negative samples than positive samples. When the optimizer started to classify the negative samples correctly and the minor positive samples incorrectly. The loss from the positive samples will dominate the total loss and thus the effective training will still go on and the optimizer will try to optimize to classify the positive samples correctly. If using ordinary cross entropy loss, although the loss from single correctly classified negative is small, because of the population advantage, the loss from the negative samples will still likely to dominate the total loss. Thus optimizer will try to further optimize to classify the negative samples better, but “ignore” the positive samples.

#### Concrete Example

Let us set $\alpha = 0.25$ and $\gamma = 2$. Thus $\alpha_t = 0.25$ for positive samples, and $\alpha_t = 0.75$ for negative samples. Note that it makes more sense to use $\alpha = 0.75$ since the positive samples are usually minorities. However, we could see in the calculations below that the contribution of $\alpha$ sometimes does not affect loss significantly. But in practice, we may fine-tune this $\alpha$ to get better model accuracies.

If we classify a negative sample to the ground truth target with probability $p_t = 0.99$ (doing awesome job), because $\alpha_t = 1 - \alpha = 0.75$, $\alpha_t(1-p_t)^\gamma = 0.000075$. $-\log(p_t) = 0.0043648054$ is already a small number, further discounting it will make it even less important in training.

If we classify a positive sample to the ground truth target with probability $p_t = 0.01$ (doing awful job), because $\alpha_t = \alpha = 0.25$, $\alpha_t(1-p_t)^\gamma = 0.245025$. Although $-\log(p_t) = 2$ is still discounted, but it is less affected compared to the correctly classified examples.

Take us see an extremely example. Suppose we have 1000000 example with $p_t = 0.99$ and 10 example with $p_t = 0.01$. The 1000000 example with $p_t = 0.99$ happen to all be negative examples and the 10 example with $p_t = 0.01$ happen to all be positive examples.

In the scenario using ordinary cross entropy loss, the loss from negative examples is $1000000 \times 0.0043648054 = 4364$ and the loss from positive examples is $10 \times 2 = 20$. The loss contribution from positive examples is $20 / (4364 + 20) = 0.0046$. Almost negligible.

In the scenario using focal loss, the loss from negative examples is

$1000000 \times 0.0043648054 \times 0.000075 = 0.3274$ and the loss from positive examples is $10 \times 2 \times 0.245025 = 4.901$. The loss contribution from positive examples is $4.901 / (4.901 + 0.3274) = 0.9374$! It is dominating the total loss now!

This extreme example demonstrated that the minor class samples will be less likely ignored during training.

#### Focal Loss Trick

In practice, the focal loss does not work well if you do not apply some tricks. In conventional neural network optimization, the weights are initialized according to some distributions so that the output from each layer will be normally distributed (Xavier initialization for example). If so, the value of the output layer for sigmoid function will also be normally distributed, because the product or sum of two Gaussian variables are also Gaussian variables. Therefore the initial probability distributions of all the regions proposed, regardless of positive or negative samples, will follow some distribution centered at 0.5. This makes the minor positive samples less important in the initial training stage. Therefore, initially, maybe for a very long period of time, the training will not go very well for the positive samples.

The trick that Facebook AI Research used is to initialize the bias term of the last layer to some non-zero value such that the $p_t$ of positive samples is small and the $p_t$ of negative samples is large. Concretely, they set the bias term $b = -\log((1-\pi)/\pi)$. Here $\pi$ is simply a variable instead of the conventional $\pi$. In their case, they set $\pi = 0.01$, so $b \approx -2.0$.

What is the value of $wx$ in the last layer? $w$ was initialized according to $W \sim \mathcal{N}(\mu_{w}, \sigma_{w}^2)$, where $\mu_{w} = 0$ and $\sigma_{w}^2 = 0.01^2 = 10^{-4}$ which is a very small number. Assuming $X \sim \mathcal{N}(\mu_{x}, \sigma_{x}^2)$, according to the distribution of the product of two Gaussian variables, $WX \sim \mathcal{N}(\mu_{wx}, \sigma_{wx}^2)$, where

\[\begin{align} \mu_{wx} &= \frac{\mu_{w}\sigma_{x}^2 + \mu_{x}\sigma_{w}^2}{\sigma_{x}^2 + \sigma_{w}^2} \\ &= \frac{10^{-4} \mu_{x}}{\sigma_{x}^2 + 10^{-4}} \end{align}\] \[\begin{align} \sigma_{wx}^2 &= \frac{\sigma_{x}^2 \sigma_{w}^2}{\sigma_{x}^2 + \sigma_{w}^2} \\ &= \frac{10^{-4} \sigma_{x}^2}{\sigma_{x}^2 + 10^{-4}} \\ &= \frac{10^{-4}}{1 + \frac{10^{-4}}{\sigma_{x}^2}} \\ \end{align}\]The values of $\mu_{x}$ and $\sigma_{x}^2$ are dependent on the neural network architecture. But as long as $\sigma_{x}^2 \gg 10^{-4}$ and $\sigma_{x}^2 \gg 10^{-4} \mu_{x}$, which is usually true, $\mu_{wx} \approx 0$ and $\sigma_{wx}^2 \approx 10^{-4}$. This means the value of $wx$ is 0 in most of the cases.

Therefore, according to sigmoid function,

\[p = S(wx + b) \approx S(b) = \frac{1}{1+e^{-b}} = \pi\]$p = 0.01$ means that for positive samples there are all extremely incorrect in the forward propagation thus receiving high weights, for negative examples there are extremely correct in the forward propagation thus receiving low weights. Therefore, the positive examples will receive “attention” in the early training stage and the whole training process is likely to go smoothly.

### Conclusions

Focal loss is very useful for training imbalanced dataset, especially in object detection tasks. However, I was surprised why such an intuitive loss function was proposed such late. I am also not sure whether my memory about seeing similar loss functions in old literature is correct. Anyway, I hope this makes focal loss, its motivation, intuition, and math crystal clear.