# Label Smoothing

## Introduction

In machine learning or deep learning, we usually use a lot of regularization techniques, such as L1, L2, dropout, etc., to prevent our model from overfitting. In classification problems, sometimes our model would learn to predict the training examples extremely confidently. This is not good for generalization.

In this blog post, I am going to talk about label smoothing as a regularization technique for classification problems to prevent the model from predicting the training examples too confidently.

## Method

In a classification problem with $K$ candidate labels $\{1,2,\cdots,K\}$, for example $i$, $(x_i, y_i)$, from training dataset, we have the ground truth distribution $p$ over labels $p(y|x_i)$, and $\sum_{y=1}^{K} p(y|x_i) = 1$. We have a model with parameters $\theta$, it predicts the predicted label distribution as $q_{\theta}(y|x_i)$, and of course $\sum_{y=1}^{K} q_{\theta}(y|x_i) = 1$.

As I described in “Cross Entropy, KL Divergence, and Maximum Likelihood Estimation”, the cross entropy for this particular example is

\begin{aligned} H_{i}(p,q_{\theta}) &= - \sum_{y=1}^{K} p(y|x_i) \log q_{\theta}(y|x_i) \\ \end{aligned}

If we have $n$ examples in the training dataset, our loss function would be

\begin{aligned} L &= \sum_{i=1}^{n} H_i(p,q_{\theta}) \\ &= - \sum_{i=1}^{n} \sum_{y=1}^{K} p(y|x_i) \log q_{\theta}(y|x_i) \\ \end{aligned}

### One-Hot Encoding Labels

Usually this $p(y|x_i)$ would be a one-hot encoded vector where

$$p(y|x_i) = \begin{cases} 1 & \text{if } y = y_i \\ 0 & \text{otherwise} \end{cases}$$

With this, we could further reduce the loss function to

\begin{aligned} L &= \sum_{i=1}^{n} H_i(p,q_{\theta}) \\ &= - \sum_{i=1}^{n} \sum_{y=1}^{K} p(y|x_i) \log q_{\theta}(y|x_i) \\ &= - \sum_{i=1}^{n} p(y_i|x_i) \log q_{\theta}(y_i|x_i) \\ &= - \sum_{i=1}^{n} \log q_{\theta}(y_i|x_i) \\ \end{aligned}

Minimizing this loss function is equivalent to do maximum likelihood estimation over the training dataset (see my proof here).

During optimization, it is possible to minimize $L$ to almost zero, if all the inputs in the dataset do not have conflicting labels. Conflicting labels means, say, there are two examples with the extract same feature from the dataset, but their ground truth labels are different.

Because usually $q_{\theta}(y_i|x_i)$ is computed from softmax function.

$$q_{\theta}(y_i|x_i) = \frac{\exp(z_{y_i})}{\sum_{j=1}^{K}\exp(z_j)}$$

Where $z_i$ is the logit for candidate class $i$.

The consequence of using one-hot encoded labels will be that $\exp(z_{y_i})$ will be extremely large and the other $\exp(z_j)$ where $j \neq y_i$ will be extremely small. Given a “non-conflicting” dataset, the model will classify every training example correctly with the confidence of almost 1. This is certainly a signature of overfitting, and the overfitted model does not generalize well.

Then how do we make sure that during training the model is not going to be too confident about the labels it predicts for the training data? Using a non-conflicting training dataset, with one-hot encoded labels, overfitting seems to be inevitable. People introduced label smoothing techniques as regularization.

### Label Smoothing

Instead of using one-hot encoded vector, we introduce noise distribution $u(y|x)$. Our new ground truth label for data $(x_i, y_i)$ would be

\begin{aligned} p^{\prime}(y|x_i) &= (1-\varepsilon) p(y|x_i) + \varepsilon u(y|x_i) \\ &= \begin{cases} 1 - \varepsilon + \varepsilon u(y|x_i) & \text{if } y = y_i \\ \varepsilon u(y|x_i) & \text{otherwise} \end{cases} \end{aligned}

Where $\varepsilon$ is a weight factor, $\varepsilon \in [0, 1]$, and note that $\sum_{y=1}^{K} p^{\prime}(y|x_i) = 1$.

We use this new ground truth label in replace of the one-hot encoded ground-truth label in our loss function.

\begin{aligned} L^{\prime} &= - \sum_{i=1}^{n} \sum_{y=1}^{K} p^{\prime}(y|x_i) \log q_{\theta}(y|x_i) \\ &= - \sum_{i=1}^{n} \sum_{y=1}^{K} \big[ (1-\varepsilon) p(y|x_i) + \varepsilon u(y|x_i) \big] \log q_{\theta}(y|x_i) \\ \end{aligned}

We further elaborate on this loss function.

\begin{aligned} L^{\prime} &= \sum_{i=1}^{n} \bigg\{ (1-\varepsilon) \Big[ - \sum_{y=1}^{K} p(y|x_i) \log q_{\theta}(y|x_i) \Big] + \varepsilon \Big[ - \sum_{y=1}^{K} u(y|x_i) \log q_{\theta}(y|x_i) \Big] \bigg\} \\ &= \sum_{i=1}^{n} \Big[ (1-\varepsilon) H_i(p,q_{\theta}) + \varepsilon H_i(u,q_{\theta}) \Big] \\ \end{aligned}

We could see that for each example in the training dataset, the loss contribution is a mixture of the cross entropy between the one-hot encoded distribution and the predicted distribution $H_i(p,q_{\theta})$, and the cross entropy between the noise distribution and the predicted distribution $H_i(u,q_{\theta})$. During training, if the model learns to predict the distribution confidently, $H_i(p,q_{\theta})$ will go close to zero, but $H_i(u,q_{\theta})$ will increase dramatically. Therefore, with label smoothing, we actually introduced a regularizer $H_i(u,q_{\theta})$ to prevent the model from predicting too confidently.

In practice, $u(y|x)$ is a uniform distribution which does not dependent on data. That is to say,

$$u(y|x) = \frac{1}{K}$$

## Conclusions

Label smoothing is a regularization technique for classification problems to prevent the model from predicting the labels too confidently during training and generalizing poorly.