### Introduction

In a neural language model with parameters $\theta$, to predict the conditional distribution of next word $w$ given context $h$, we usually use softmax function. Here the conditional distribution for predicted word distribution is defined as

where $s_{\theta}(w,h)$ is usually called score or logit for word $w$ in the model, $u_{\theta}(w,h) = \exp(s_{\theta}(w,h))$, and $Z_\theta^h$ is the normalizer given context $h$ and it does not dependent on word $w$. It is because

Therefore,

Because word is a discrete case, we actually have

where $w^{\prime} \in W$ and $W$ is all the words in the corpus.

Suppose we have examples for certain context from dataset $w \sim P_d^h(w)$, where $h$ is the context and $w$ is the target word. Neural language model is trying to maximize the expected value of the conditional probability $P_{\theta}^{h}(w)$. Note that in our notations, $P_{\theta}^{h}(w^{\prime})$ is the conditional distribution for all words, and $P_{\theta}^{h}(w)$ is the conditional probability for the target words $w$.

Because $P_{\theta}^{h}(w)$ contains $Z_\theta^h$, this raises a problem. When the corpus is very large, computing the exact $Z_\theta^h$ will be extremely expensive. Therefore, training neural language models using this way will be slow.

In this blog post, I will talk about the traditional way of neural language model optimization, and how to accelerate the optimization using noise contrastive estimation (NCE). I will also resolve the misunderstandings of noise contrastive estimation as I found almost all the blog posts and the concept in code implementations on this topic are incorrect.

### Prerequisites

In addition to the knowledge of basic calculus, probability, and statistics, we will use some properties heavily in this blog post.

#### Leibniz Integral Rule

Leibniz integral rule allows swapping positions of derivatives under certain circumstances.

To check the quick proof of Leibniz integral rule, please check one my blog posts on this.

#### Derivatives of Expected Value

Based on Leibniz integral rule, we could also move the positions of derivatives inside the expected value or outside the expected value. For instance,

#### Central Limit Theorem

If we have a random variable $X$, the expected value of $X$ under certain distribution $P(X)$ is the distribution mean $\mu$. That is to say,

To estimate the distribution mean, we often randomly sample $n$ from the distribution $X_1, X_2, \cdots, X_n$, and calculate the sample mean, which is $\overline{X}_n$.

The variance of sample mean is the distribution variance $\sigma^2$ divided by $n$.

This means that the larger the sample size, the higher probability our sample mean estimate more close to the distribution mean.

That is fundamentally why in machine learning we want to have batch size as large as possible to estimate the true gradient.

#### Logistic Regression, Sigmoid Function, and Log Loss

In a binary classification model $h$ containing parameters $\theta$, usually we use sigmoid function $\sigma$. The probability of a sample being classified as positive class is

The object function for such logistic regression problem is sometime called log loss.

### Maximum Likelihood Estimation

Given context $h$, we have a target word $w \sim P_d^h(w)$. In a trainable language model containing parameters $\theta$, its log conditional probability is

We want to maximize the expected value of the log conditional probability predicting the target word $\mathbb{E}_{w \sim P_d^h(w)}\big[\log P_{\theta}^{h}(w)\big]$.

To maximize $\mathbb{E}_{w \sim P_d^h(w)}\big[\log P_{\theta}^{h}(w)\big]$, we compute its derivative with respect to $\theta$.

Let’s see what $\frac{\partial}{\partial \theta} \log Z_\theta^h$ actually is.

Therefore,

Alternatively, we could write

These are the maximum likelihood estimation (MLE) gradient expressions for models using full softmax which requires computing the extract $Z_\theta^h$.

### Noise Contrastive Estimation

Noise contrastive estimation reduce the complexity of optimization by replacing the multi-class classification to binary classification and using sampling from noise distributions.

Concretely, we introduce noise distribution $P_n(w)$. This noise distribution could be context-dependent or context independent. For simplicity, we assume $P_n(w)$ is context-independent.

We also set the samples from noise distribution $P_n(w)$ is $k$ times more frequent for the samples from dataset distribution $P_d^h(w)$. We denote $D=1$ when the sample was sampled from dataset distribution, and $D=0$ when the sample was sampled from noise distribution. We have

We denote $D=1$ when the sample was sampled from dataset distribution, and $D=0$ when the sample was sampled from noise distribution.

Similarly,

We want to develop a model containing parameters $\theta$ such that given context $h$ its predicted probability from softmax $P_{\theta}^h(w)$ approximates $P_d^h(w)$ in the dataset. Therefore, we have

Because

In noise contrastive estimation, we do not compute the expensive normalizer $Z_{\theta}^h$. So we use a parameter $c_h$ for context $h$ to be learned, and $c_h$ provide the information for computing the normalizer. Concretely,

where $P_{\theta^0}^h(w)$ is the unnormalized score for word $w$ in the model given context $h$, $\exp(c^h)$ is the normalizer, and $c^h$ does not dependent on $\theta^0$. We denote $\theta = {\theta^0, c^h}$.

$P_{\theta^0}^h(w)$ was defined as follows.

Note that for different context $h$, we need different $c^h$. When the number of different $h$ is large, we will have a large number of the parameters in the model.

We define $d$ the source distribution of the target word $w$. Our training objective is to maximize the expected value of likelihood under the distribution of dataset, i.e., $\mathbb{E}_{w \sim P^h(w)}\big[\log P^{h}(d | w, \theta)\big]$.

We elaborate on this expected value.

Because for all $w$ from $P_d^h(w)$, their source label is $d=1$, and for all $w$ from $P_n(w)$, their source label is $d=0$, we further have

We define the objective function $J^{h}(\theta)$ for the context $h$.

To compute the derivatives of $J^{h}(\theta)$ with respect to $\theta$, we have

We will compute the derivatives for both of the terms.

Therefore,

We use the definition of expected values.

Remember we have defined

and

We could further have

We want to compute the derivative with respect to $\theta^0$.

When $k \rightarrow \infty$,

This derivative looks like something familiar. Doesn’t it?

Yes! This is exactly the MLE gradient! This means that as the ratio of noise samples to observations from dataset $k$ increases, the NCE gradient approaches the MLE gradient. This is the most fundamental theory on why NCE will work in machine learning.

### Noise Contrastive Estimation in Practice

Remember we have defined the objective function $J^{h}(\theta)$ for the context $h$.

In practice, to estimate the objective function $J^{h}(\theta)$, we have $m$ target word $w_i$s corresponding to context $h$ sampled from dataset, and $n$ word $w_j$s sampled from noise distribution.

In the original paper “A Fast and Simple Algorithm for Training Neural Probabilistic Language Models”, the authors suggested to use one word $w_0$ from dataset and $k$ words $w_1, w_2, \cdots, w_k$ from noise distributions, i.e., $m = 1$, and $n = k$.

Unfortunately almost all the people misinterpreted it as $m$ has to be 1 and $n$ has to be $k$. This is a huge misunderstanding and I will elaborate it in the next section.

As I mentioned previously, if the number of different context is large, the number of learned parameters for the normalizer $c^h$ will also be large. This is not favorable for complex problems. The authors have found that by setting $\exp(c^h) = 1$, the model still works very well. This is probably the model is learning a “self-normalized” distribution. So in practice, we set $\exp(c^h) = 1$, i.e., there is no learned parameters for the normalizer, and $P_{\theta}^h(w)$ = $P_{\theta^0}^h(w)$ = $u_{\theta^0}(w,h)$ = $\exp(s_{\theta^0}(w,h))$. Note that because there is no $c^h$ anymore, in the model, $\theta = {\theta^0}$.

This looks great. But we could take a step back and further transform the object function $J^{h}(\theta)$ to something we are familiar with.

Because

We notice that this is sigmoid function.

With the settings of $\exp(c^h) = 1$, we further have

where $\Delta s_{\theta^0}(w,h) = s_{\theta^0}(w,h) - \log kP_n(w)$.

$J^{h}(\theta)$ and ${\widehat{J^{h}}}(\theta)$ now becomes

Now the two $\log$ terms in the ${\widehat{J^{h}}}(\theta)$ could be calculated using log loss!

It should be noted that the above object functions are specific to context $h$. The global objective function is defined as follows using the concept similar to “batch” in machine learning.

To summarize how to do NCE in practice, for each context and word pair $(h, w)$ in the dataset, in this case $m = 1$, and we sample $n$ words from noise distribution. We denote the target word as $w_0$ and set its label as 1, and words from noise as { $w_1, w_2, \cdots, w_n$ }, and set their label as 0. We compute the $\Delta s_{\theta^0}(w,h)$, which is the digits for word subtracted by the log of the expected number of the word frequency in $k$ samples under the noise distribution, for each of the $1+n$ words. Then we compute the log loss using the computed $\Delta s_{\theta^0}(w,h)$ and the corresponding label. Finally we put different pieces together using the recipe we derived above.

To update the model parameters, simply do back propagation.

### Misunderstandings of Noise Contrastive Estimation

To the best of my knowledge, in all the blog posts on NCE I have seen, and even the implementation of NCE loss in TensorFlow, they could not distinguish the noise to data ratio $k$ and the number of samples $n$ from noise distribution. In the TensorFlow implementation, implicitly I can tell the author thinks $n$ and $k$ are the same things. They simply added the binary softmax loss (log loss) for the data from dataset and the binary softmax losses for $n$ samples from noise distributions and serve it as the NCE loss without even providing a variable for the noise to data ratio $k$. Although it still works well when $n$ is large because then $k$ will also be large. More scientifically speaking, we should distinguish these two parameters well. We use large $k$ because we want to make NCE optimization behave like MLE optimization. We use large $n$ because we want to estimate the loss contribution from the noise distribution better.

### Final Remarks

Actually it is funny that nowadays people could make the machine learning model work well even if sometimes they did not fully understand the theory.

### References

- A Fast and Simple Algorithm for Training Neural Probabilistic Language Models
- Learning word embeddings efficiently with noise-contrastive estimation