Lei Mao bio photo

Lei Mao

Machine Learning, Artificial Intelligence. On the Move.

Twitter Facebook LinkedIn GitHub   G. Scholar E-Mail RSS

Introduction

In machine learning or deep learning, we usually use a lot of regularization techniques, such as L1, L2, dropout, etc., to prevent our model from overfitting. In classification problems, sometimes our model would learn to predict the training examples extremely confidently. This is not good for generalization.


In this blog post, I am going to talk about label smoothing as a regularization technique for classification problems to prevent the model from predicting the training examples too confidently.

Method

In a classification problem with $K$ candidate labels $\{1,2,\cdots,K\}$, for example $i$, $(x_i, y_i)$, from training dataset, we have the ground truth distribution $p$ over labels $p(y|x_i)$, and $\sum_{y=1}^{K} p(y|x_i) = 1$. We have a model with parameters $\theta$, it predicts the predicted label distribution as $q_{\theta}(y|x_i)$, and of course $\sum_{y=1}^{K} q_{\theta}(y|x_i) = 1$.


As I described in “Cross Entropy, KL Divergence, and Maximum Likelihood Estimation”, the cross entropy for this particular example is

If we have $n$ examples in the training dataset, our loss function would be

One-Hot Encoding Labels

Usually this $p(y|x_i)$ would be a one-hot encoded vector where

With this, we could further reduce the loss function to

Minimizing this loss function is equivalent to do maximum likelihood estimation over the training dataset (see my proof here).


During optimization, it is possible to minimize $L$ to almost zero, if all the inputs in the dataset do not have conflicting labels. Conflicting labels means, say, there are two examples with the extract same feature from the dataset, but their ground truth labels are different.


Because usually $q_{\theta}(y_i|x_i)$ is computed from softmax function.

Where $z_i$ is the logit for candidate class $i$.


The consequence of using one-hot encoded labels will be that $\exp(z_{y_i})$ will be extremely large and the other $\exp(z_j)$ where $j \neq y_i$ will be extremely small. Given a “non-conflicting” dataset, the model will classify every training example correctly with confidence of almost 1. This is certainly a signature of overfitting, and overfitted model does not generalize well.


Then how do we make sure that during training the model is not going to be too confident about the labels it predicts for the training data? Using a non-conflicting training dataset, with one-hot encoded labels, overfitting seems to be inevitable. People introduced label smoothing techniques as regularization.

Label Smoothing

Instead of using one-hot encoded vector, we introduce noise distribution $u(y|x)$. Our new ground truth label for data $(x_i, y_i)$ would be

Where $\varepsilon$ is a weight factor and note that $\sum_{y=1}^{K} p^{\prime}(y|x_i) = 1$.


We use this new ground truth label in replace of the one-hot encoded ground truth label in our loss function.

We further elaborate on this loss function.

We could see that for each example in the training dataset, the loss contribution is a mixture of the cross entropy between the one-hot encoded distribution and the predicted distribution $H_i(p,q_{\theta})$, and the cross entropy between the noise distribution and the predicted distribution $H_i(u,q_{\theta})$. During training, if the model learns to predict the distribution confidently, $H_i(p,q_{\theta})$ will go close to zero, but $H_i(u,q_{\theta})$ will increase dramatically. Therefore, with label smoothing, we actually introduced a regularizer $H_i(u,q_{\theta})$ to prevent the model from predicting too confidently.


In practice, $u(y|x)$ is a uniform distribution which does not dependent on data. That is to say,

Conclusions

Label smoothing is a regularization technique for classification problems to prevent the model from predicting the labels too confidently during training and generalizing poorly.

References