Noise Contrastive Estimation
Introduction
In a neural language model with parameters $\theta$, to predict the conditional distribution of next word $w$ given context $h$, we usually use softmax function. Here the conditional distribution for predicted word distribution is defined as
$$
\begin{aligned}
P_{\theta}^{h}(w) &= \frac{\exp(s_{\theta}(w,h))}{Z_\theta^h} \\
&= \frac{u_{\theta}(w,h)}{Z_\theta^h}
\end{aligned}
$$
where $s_{\theta}(w,h)$ is usually called score or logit for word $w$ in the model, $u_{\theta}(w,h) = \exp(s_{\theta}(w,h))$, and $Z_\theta^h$ is the normalizer given context $h$ and it does not dependent on word $w$. It is because
$$
\begin{aligned}
\int\limits_{w^{\prime}} P_{\theta}^{h}(w^{\prime}) dw^{\prime} &= \int\limits_{w} \frac{u_{\theta}(w^{\prime},h)}{Z_\theta^h} dw^{\prime} \\
&= \frac{\int\limits_{w^{\prime}} u_{\theta}(w^{\prime},h) dw^{\prime}}{Z_\theta^h} \\
&= 1
\end{aligned}
$$
Therefore,
$$
\begin{aligned}
Z_\theta^h &= \int\limits_{w^{\prime}} u_{\theta}(w^{\prime},h) dw^{\prime} \\
&= \int\limits_{w^{\prime}} \exp(s_{\theta}(w^{\prime},h)) dw^{\prime}
\end{aligned}
$$
Because word is a discrete case, we actually have
$$
\begin{aligned}
Z_\theta^h &= \sum\limits_{w^{\prime}} u_{\theta}(w^{\prime},h) \\
&= \sum\limits_{w^{\prime}} \exp(s_{\theta}(w^{\prime},h))
\end{aligned}
$$
where $w^{\prime} \in W$ and $W$ is all the words in the corpus.
Suppose we have examples for certain context from dataset $w \sim P_d^h(w)$, where $h$ is the context and $w$ is the target word. Neural language model is trying to maximize the expected value of the conditional probability $P_{\theta}^{h}(w)$. Note that in our notations, $P_{\theta}^{h}(w^{\prime})$ is the conditional distribution for all words, and $P_{\theta}^{h}(w)$ is the conditional probability for the target words $w$.
Because $P_{\theta}^{h}(w)$ contains $Z_\theta^h$, this raises a problem. When the corpus is very large, computing the exact $Z_\theta^h$ will be extremely expensive. Therefore, training neural language models using this way will be slow.
In this blog post, I will talk about the traditional way of neural language model optimization, and how to accelerate the optimization using noise contrastive estimation (NCE). I will also resolve the misunderstandings of noise contrastive estimation as I found almost all the blog posts and the concept in code implementations on this topic are incorrect.
Prerequisites
In addition to the knowledge of basic calculus, probability, and statistics, we will use some properties heavily in this blog post.
Leibniz Integral Rule
Leibniz integral rule allows swapping positions of derivatives under certain circumstances.
$$
\frac{d}{dx} \int_{a}^{b} f(x,t) dt = \int_{a}^{b} \frac{\partial}{\partial x} f(x,t) dt
$$
To check the quick proof of Leibniz integral rule, please check one my blog posts on this.
Derivatives of Expected Value
Based on Leibniz integral rule, we could also move the positions of derivatives inside the expected value or outside the expected value. For instance,
$$
\begin{aligned}
\frac{\partial}{\partial \theta} \mathbb{E}_{w \sim P(w)}\big[ f_{\theta}(w) \big] &= \frac{\partial}{\partial \theta} \int\limits_{w} P(w) f_{\theta}(w) dw \\
&= \int\limits_{w} P(w) \frac{\partial}{\partial \theta} f_{\theta}(w) dw \\
&= \mathbb{E}_{w \sim P(w)}\big[ \frac{\partial}{\partial \theta} f_{\theta}(w) \big]
\end{aligned}
$$
Central Limit Theorem
If we have a random variable $X$, the expected value of $X$ under certain distribution $P(X)$ is the distribution mean $\mu$. That is to say,
$$
\mathbb{E}[X] = \mu
$$
To estimate the distribution mean, we often randomly sample $n$ from the distribution $X_1, X_2, \cdots, X_n$, and calculate the sample mean, which is $\overline{X}_n$.
$$
\begin{aligned}
\mathbb{E}[\overline{X}_n] &= \mathbb{E} \big[ \frac{\sum_{i=1}^{n} X_i}{n} \big] \\
&= \frac{1}{n} \sum_{i=1}^{n} \mathbb{E} \big[ X_i \big] \\
&= \frac{1}{n} n\mu \\
&= \mu
\end{aligned}
$$
The variance of the sample mean is the distribution variance $\sigma^2$ divided by $n$.
$$
\mathbb{V}[\overline{X}_n] = \frac{\sigma^2}{n}
$$
This means that the larger the sample size, the higher the probability our sample mean estimate gets closer to the distribution mean.
That is fundamentally why in machine learning we want to have batch size as large as possible to estimate the true gradient.
Logistic Regression, Sigmoid Function, and Log Loss
In a binary classification model $h$ containing parameters $\theta$, usually we use sigmoid function $\sigma$. The probability of a sample being classified as positive class is
$$
\begin{aligned}
P(y=1|x) &= \sigma(h_{\theta}(x)) \\
&= \frac{1}{1 + \exp(-h_{\theta}(x))} \\
\end{aligned}
$$
The object function for such a logistic regression problem is sometimes called log loss.
$$
J(\theta) = \mathbb{E}_{(x,y) \sim P_d(x, y)}\big[ y \log(\sigma(h_{\theta}(x))) + (1-y) \log(1- \sigma(h_{\theta}(x))) \big]
$$
Maximum Likelihood Estimation
Given context $h$, we have a target word $w \sim P_d^h(w)$. In a trainable language model containing parameters $\theta$, its log conditional probability is
$$
\begin{aligned}
\log P_{\theta}^{h}(w) &= \log \frac{u_{\theta}(w,h)}{Z_\theta^h} \\
&= \log u_{\theta}(w,h) - \log Z_\theta^h \\
&= s_{\theta}(w,h) - \log Z_\theta^h
\end{aligned}
$$
We want to maximize the expected value of the log conditional probability predicting the target word $\mathbb{E}_{w \sim P_d^h(w)}\big[\log P_{\theta}^{h}(w)\big]$.
To maximize $\mathbb{E}_{w \sim P_d^h(w)}\big[\log P_{\theta}^{h}(w)\big]$, we compute its derivative with respect to $\theta$.
$$
\begin{aligned}
\frac{\partial}{\partial \theta} \mathbb{E}_{w \sim P_d^h(w)}\big[\log P_{\theta}^{h}(w)\big] &= \frac{\partial}{\partial \theta} \mathbb{E}_{w \sim P_d^h(w)}\big[s_{\theta}(w,h) - \log Z_\theta^h\big] \\
&= \frac{\partial}{\partial \theta} \mathbb{E}_{w \sim P_d^h(w)}\big[s_{\theta}(w,h)\big] - \frac{\partial}{\partial \theta} \mathbb{E}_{w \sim P_d^h(w)}\big[\log Z_\theta^h\big] \\
&= \mathbb{E}_{w \sim P_d^h(w)}\big[ \frac{\partial}{\partial \theta} s_{\theta}(w,h) \big] - \frac{\partial}{\partial \theta} \log Z_\theta^h \\
\end{aligned}
$$
Let’s see what $\frac{\partial}{\partial \theta} \log Z_\theta^h$ actually is.
$$
\begin{aligned}
\frac{\partial}{\partial \theta} \log Z_\theta^h &= \frac{1}{Z_\theta^h} \frac{\partial}{\partial \theta} Z_\theta^h \\
&= \frac{1}{Z_\theta^h} \frac{\partial}{\partial \theta} \int\limits_{w} \exp(s_{\theta}(w,h)) dw \\
&= \frac{1}{Z_\theta^h} \int\limits_{w} \frac{\partial}{\partial \theta} \exp(s_{\theta}(w,h)) dw \\
&= \frac{1}{Z_\theta^h} \int\limits_{w} \exp(s_{\theta}(w,h)) \frac{\partial}{\partial \theta} s_{\theta}(w,h) dw \\
&= \int\limits_{w} \frac{\exp(s_{\theta}(w,h))}{Z_\theta^h} \frac{\partial}{\partial \theta} s_{\theta}(w,h) dw \\
&= \int\limits_{w} P_{\theta}^h (w) \frac{\partial}{\partial \theta} s_{\theta}(w,h) dw \\
&= \mathbb{E}_{w \sim P_{\theta}^h(w)}\big[ \frac{\partial}{\partial \theta} s_{\theta}(w,h) \big]
\end{aligned}
$$
Therefore,
$$
\begin{aligned}
\frac{\partial}{\partial \theta} \mathbb{E}_{w \sim P_d^h(w)}\big[\log P_{\theta}^{h}(w)\big] &= \mathbb{E}_{w \sim P_d^h(w)}\big[ \frac{\partial}{\partial \theta} s_{\theta}(w,h) \big] - \mathbb{E}_{w \sim P_{\theta}^h(w)}\big[ \frac{\partial}{\partial \theta} s_{\theta}(w,h) \big] \\
\end{aligned}
$$
Alternatively, we could write
$$
\begin{aligned}
\frac{\partial}{\partial \theta} \mathbb{E}_{w \sim P_d^h(w)}\big[\log P_{\theta}^{h}(w)\big] &= \mathbb{E}_{w \sim P_d^h(w)}\big[ \frac{\partial}{\partial \theta} s_{\theta}(w,h) \big] - \mathbb{E}_{w \sim P_{\theta}^h(w)}\big[ \frac{\partial}{\partial \theta} s_{\theta}(w,h) \big] \\
&= \sum\limits_{w} P_{d}^{h}(w) \frac{\partial}{\partial \theta} s_{\theta}(w,h) - \sum\limits_{w} P_{\theta}^{h}(w) \frac{\partial}{\partial \theta} s_{\theta}(w,h) \\
&= \sum\limits_{w} \big( P_{d}^{h}(w) - P_{\theta}^{h}(w) \big) \frac{\partial}{\partial \theta} s_{\theta}(w,h) \\
&= \sum\limits_{w} \big( P_{d}^{h}(w) - P_{\theta}^{h}(w) \big) \frac{\partial}{\partial \theta} \log u_{\theta}(w,h)
\end{aligned}
$$
These are the maximum likelihood estimation (MLE) gradient expressions for models using full softmax which requires computing the extract $Z_\theta^h$.
Noise Contrastive Estimation
Noise contrastive estimation reduces the complexity of optimization by replacing the multi-class classification to binary classification and using sampling from noise distributions.
Concretely, we introduce noise distribution $P_n(w)$. This noise distribution could be context-dependent or context independent. For simplicity, we assume $P_n(w)$ is context-independent.
We also set the samples from noise distribution $P_n(w)$ is $k$ times more frequent for the samples from dataset distribution $P_d^h(w)$. We denote $D=1$ when the sample was sampled from dataset distribution, and $D=0$ when the sample was sampled from noise distribution. We have
$$
\begin{aligned}
P^h(D=1,w) = \frac{1}{k+1} P_d^h(w) \\
P^h(D=0,w) = \frac{k}{k+1} P_n^h(w)
\end{aligned}
$$
$$
\begin{aligned}
p^h(w) &= P^h(D=1,w) + P^h(D=0,w) \\
&= \frac{1}{k+1}P_d^h(w) + \frac{k}{k+1}P_n(w) \\
&= \frac{1}{k+1}\big[ P_d^h(w) + kP_n(w) \big]
\end{aligned}
$$
We denote $D=1$ when the sample was sampled from dataset distribution, and $D=0$ when the sample was sampled from noise distribution.
$$
\begin{aligned}
p^h(D=1|w) &= \frac{P^h(D=1,w)}{p^h(w)} \\
&= \frac{P_d^h(w)}{P_d^h(w) + kP_n(w)} \\
\end{aligned}
$$
Similarly,
$$
\begin{aligned}
p^h(D=0|w) &= \frac{P^h(D=0,w)}{p^h(w)} \\
&= \frac{kP_n(w)}{P_d^h(w) + kP_n(w)} \\
\end{aligned}
$$
We want to develop a model containing parameters $\theta$ such that given context $h$ its predicted probability from softmax $P_{\theta}^h(w)$ approximates $P_d^h(w)$ in the dataset. Therefore, we have
$$
\begin{aligned}
p^h(D=1|w, \theta) = \frac{P_{\theta}^h(w)}{P_{\theta}^h(w) + kP_n(w)} \\
p^h(D=0|w, \theta) = \frac{kP_n(w)}{P_{\theta}^h(w) + kP_n(w)} \\
\end{aligned}
$$
Because
$$
\begin{aligned}
P_{\theta}^h(w) &= \frac{u_{\theta}(w,h)}{Z_\theta^h}\\
&= \frac{\exp(s_{\theta}(w,h))}{Z_\theta^h}
\end{aligned}
$$
In noise contrastive estimation, we do not compute the expensive normalizer $Z_{\theta}^h$. So we use a parameter $c_h$ for context $h$ to be learned, and $c_h$ provide the information for computing the normalizer. Concretely,
$$
P_{\theta}^h(w) = P_{\theta^0}^h(w) \exp(c^h)
$$
where $P_{\theta^0}^h(w)$ is the unnormalized score for word $w$ in the model given context $h$, $\exp(c^h)$ is the normalizer, and $c^h$ does not dependent on $\theta^0$. We denote $\theta = \{\theta^0, c^h\}$.
$P_{\theta^0}^h(w)$ was defined as follows.
$$
\begin{aligned}
P_{\theta^0}^h(w) &= \exp(s_{\theta^0}(w,h)) \\
&= u_{\theta^0}(w,h)
\end{aligned}
$$
Note that for different context $h$, we need different $c^h$. When the number of different $h$ is large, we will have a large number of the parameters in the model.
We define $d$ the source distribution of the target word $w$. Our training objective is to maximize the expected value of likelihood under the distribution of dataset, i.e., $\mathbb{E}_{w \sim P^h(w)}\big[\log P^{h}(d | w, \theta)\big]$.
We elaborate on this expected value.
$$
\begin{aligned}
\mathbb{E}_{w \sim P^h(w)}\big[\log P^{h}(d | w, \theta)\big] &= \int\limits_{w} p^h(w) \log P^{h}(d | w, \theta) dw \\
&= \int\limits_{w} \frac{1}{k+1}\big[ P_d^h(w) + kP_n(w) \big] \log P^{h}(d | w, \theta) dw \\
&= \frac{1}{k+1} \big[ \int\limits_{w} P_d^h(w) \log P^{h}(d | w, \theta) dw + k \int\limits_{w} P_n(w) \log P^{h}(d | w, \theta) dw \big] \\
&= \frac{1}{k+1} \Big\{ \mathbb{E}_{w \sim P_d^h(w)}\big[\log P^{h}(d | w, \theta)\big] + k \mathbb{E}_{w \sim P_n(w)}\big[\log P^{h}(d | w, \theta)\big] \Big\} \\
\end{aligned}
$$
Because for all $w$ from $P_d^h(w)$, their source label is $d=1$, and for all $w$ from $P_n(w)$, their source label is $d=0$, we further have
$$
\begin{aligned}
\mathbb{E}_{w \sim P^h(w)}\big[\log P^{h}(d | w, \theta)\big] &= \frac{1}{k+1} \Big\{ \mathbb{E}_{w \sim P_d^h(w)}\big[\log P^{h}(d | w, \theta)\big] + k \mathbb{E}_{w \sim P_n(w)}\big[\log P^{h}(d | w, \theta)\big] \Big\} \\
&= \frac{1}{k+1} \Big\{ \mathbb{E}_{w \sim P_d^h(w)}\big[\log P^{h}(D=1 | w, \theta)\big] + k \mathbb{E}_{w \sim P_n(w)}\big[\log P^{h}(D=0 | w, \theta)\big] \Big\} \\
&= \frac{1}{k+1} \Big\{ \mathbb{E}_{w \sim P_d^h(w)}\big[\log \frac{P_{\theta}^h(w)}{P_{\theta}^h(w) + kP_n(w)} \big] + k \mathbb{E}_{w \sim P_n(w)}\big[\log \frac{kP_n(w)}{P_{\theta}^h(w) + kP_n(w)} \big] \Big\} \\
\end{aligned}
$$
We define the objective function $J^{h}(\theta)$ for the context $h$.
$$
J^{h}(\theta) = \mathbb{E}_{w \sim P_d^h(w)}\big[\log \frac{P_{\theta}^h(w)}{P_{\theta}^h(w) + kP_n(w)} \big] + k \mathbb{E}_{w \sim P_n(w)}\big[\log \frac{kP_n(w)}{P_{\theta}^h(w) + kP_n(w)} \big]
$$
To compute the derivatives of $J^{h}(\theta)$ with respect to $\theta$, we have
$$
\begin{aligned}
\frac{\partial}{\partial \theta} J^{h}(\theta) &= \frac{\partial}{\partial \theta} \mathbb{E}_{w \sim P_d^h(w)}\big[\log \frac{P_{\theta}^h(w)}{P_{\theta}^h(w) + kP_n(w)} \big] + k \frac{\partial}{\partial \theta} \mathbb{E}_{w \sim P_n(w)}\big[\log \frac{kP_n(w)}{P_{\theta}^h(w) + kP_n(w)} \big] \\
&= \mathbb{E}_{w \sim P_d^h(w)}\big[ \frac{\partial}{\partial \theta} \log \frac{P_{\theta}^h(w)}{P_{\theta}^h(w) + kP_n(w)} \big] + k \mathbb{E}_{w \sim P_n(w)}\big[ \frac{\partial}{\partial \theta} \log \frac{kP_n(w)}{P_{\theta}^h(w) + kP_n(w)} \big]
\end{aligned}
$$
We will compute the derivatives for both of the terms.
$$
\begin{aligned}
\frac{\partial}{\partial \theta} \log \frac{P_{\theta}^h(w)}{P_{\theta}^h(w) + kP_n(w)} &= - \frac{\partial}{\partial \theta} \log \big(1 + \frac{kP_n(w)}{P_{\theta}^h(w)} \big) \\
&= - \frac{1}{1 + \frac{kP_n(w)}{P_{\theta}^h(w)}} \big[ 0 - \frac{kP_n(w)}{[ P_{\theta}^h(w) ]^2 } \frac{\partial}{\partial \theta} P_{\theta}^h(w) \big] \\
&= \frac{P_{\theta}^h(w)}{P_{\theta}^h(w) + kP_n(w)} \frac{kP_n(w)}{[ P_{\theta}^h(w) ]^2} \frac{\partial}{\partial \theta} P_{\theta}^h(w) \\
&= \frac{kP_n(w)}{P_{\theta}^h(w) + kP_n(w)} \frac{1}{P_{\theta}^h(w)} \frac{\partial}{\partial \theta} P_{\theta}^h(w) \\
&= \frac{kP_n(w)}{P_{\theta}^h(w) + kP_n(w)} \frac{\partial}{\partial \theta} \log P_{\theta}^h(w) \\
\end{aligned}
$$
$$
\begin{aligned}
\frac{\partial}{\partial \theta} \log \frac{kP_n(w)}{P_{\theta}^h(w) + kP_n(w)} &= - \frac{\partial}{\partial \theta} \log \big(1 + \frac{P_{\theta}^h(w)}{kP_n(w)} \big) \\
&= - \frac{1}{1 + \frac{P_{\theta}^h(w)}{kP_n(w)}} \big[ 0 + \frac{1}{kP_n(w)} \frac{\partial}{\partial \theta} P_{\theta}^h(w) \big] \\
&= - \frac{kP_n(w)}{P_{\theta}^h(w) + kP_n(w)} \frac{1}{kP_n(w)} \frac{\partial}{\partial \theta} P_{\theta}^h(w) \\
&= - \frac{1}{P_{\theta}^h(w) + kP_n(w)} \frac{\partial}{\partial \theta} P_{\theta}^h(w) \\
&= - \frac{P_{\theta}^h(w)}{P_{\theta}^h(w) + kP_n(w)} \frac{1}{P_{\theta}^h(w)} \frac{\partial}{\partial \theta} P_{\theta}^h(w) \\
&= - \frac{P_{\theta}^h(w)}{P_{\theta}^h(w) + kP_n(w)} \frac{\partial}{\partial \theta} \log P_{\theta}^h(w) \\
\end{aligned}
$$
Therefore,
$$
\begin{aligned}
\frac{\partial}{\partial \theta} J^{h}(\theta) &= \mathbb{E}_{w \sim P_d^h(w)}\big[ \frac{\partial}{\partial \theta} \log \frac{P_{\theta}^h(w)}{P_{\theta}^h(w) + kP_n(w)} \big] + k \mathbb{E}_{w \sim P_n(w)}\big[ \frac{\partial}{\partial \theta} \log \frac{kP_n(w)}{P_{\theta}^h(w) + kP_n(w)} \big] \\
&= \mathbb{E}_{w \sim P_d^h(w)}\big[ \frac{kP_n(w)}{P_{\theta}^h(w) + kP_n(w)} \frac{\partial}{\partial \theta} \log P_{\theta}^h(w) \big] - k \mathbb{E}_{w \sim P_n(w)}\big[ \frac{P_{\theta}^h(w)}{P_{\theta}^h(w) + kP_n(w)} \frac{\partial}{\partial \theta} \log P_{\theta}^h(w) \big] \\
\end{aligned}
$$
We use the definition of expected values.
$$
\begin{aligned}
\frac{\partial}{\partial \theta} J^{h}(\theta) &= \mathbb{E}_{w \sim P_d^h(w)}\big[ \frac{kP_n(w)}{P_{\theta}^h(w) + kP_n(w)} \frac{\partial}{\partial \theta} \log P_{\theta}^h(w) \big] - k \mathbb{E}_{w \sim P_n(w)}\big[ \frac{P_{\theta}^h(w)}{P_{\theta}^h(w) + kP_n(w)} \frac{\partial}{\partial \theta} \log P_{\theta}^h(w) \big] \\
&= \sum \limits_{w} P_d^h(w) \frac{kP_n(w)}{P_{\theta}^h(w) + kP_n(w)} \frac{\partial}{\partial \theta} \log P_{\theta}^h(w) - k\sum \limits_{w} P_n(w) \frac{P_{\theta}^h(w)}{P_{\theta}^h(w) + kP_n(w)} \frac{\partial}{\partial \theta} \log P_{\theta}^h(w) \\
&= \sum \limits_{w} \frac{kP_n(w)}{P_{\theta}^h(w) + kP_n(w)} \big( P_d^h(w) - P_{\theta}^h(w) \big) \frac{\partial}{\partial \theta} \log P_{\theta}^h(w) \\
\end{aligned}
$$
Remember we have defined
$$
P_{\theta}^h(w) = P_{\theta^0}^h(w) \exp(c^h)
$$
and
$$
P_{\theta^0}^h(w) = u_{\theta^0}(w,h)
$$
We could further have
$$
\begin{aligned}
\frac{\partial}{\partial \theta} J^{h}(\theta) &= \sum \limits_{w} \frac{kP_n(w)}{P_{\theta}^h(w) + kP_n(w)} \big( P_d^h(w) - P_{\theta}^h(w) \big) \frac{\partial}{\partial \theta} \big[ \log P_{\theta^0}^h(w) + c^h \big] \\
\end{aligned}
$$
We want to compute the derivative with respect to $\theta^0$.
$$
\begin{aligned}
\frac{\partial}{\partial \theta^0} J^{h}(\theta) &= \sum \limits_{w} \frac{kP_n(w)}{P_{\theta}^h(w) + kP_n(w)} \big( P_d^h(w) - P_{\theta}^h(w) \big) \frac{\partial}{\partial \theta^0} \big[ \log P_{\theta^0}^h(w) + c^h \big] \\
&= \sum \limits_{w} \frac{kP_n(w)}{P_{\theta}^h(w) + kP_n(w)} \big( P_d^h(w) - P_{\theta}^h(w) \big) \frac{\partial}{\partial \theta^0} \log P_{\theta^0}^h(w)\\
\end{aligned}
$$
When $k \rightarrow \infty$,
$$
\begin{aligned}
\frac{\partial}{\partial \theta^0} J^{h}(\theta) &\rightarrow \sum \limits_{w} \big( P_d^h(w) - P_{\theta}^h(w) \big) \frac{\partial}{\partial \theta^0} \log P_{\theta^0}^h(w) \\
&= \sum \limits_{w} \big( P_d^h(w) - P_{\theta}^h(w) \big) \frac{\partial}{\partial \theta^0} \log u_{\theta^0}(w,h)
\end{aligned}
$$
This derivative looks like something familiar. Doesn’t it?
Assuming $\exp(c^h)$ does learn the normalizer, the NCE gradient will be exactly the same as the MLE gradient, as the ratio of noise samples to observations from dataset $k$ increases.
However, $\exp(c^h)$ does not guarantee to learn the normalizer and $P_{\theta}^h(w) = P_{\theta^0}^h(w) \exp(c^h)$ is usually not a probability distribution after training. But because the objective function $J^{h}(\theta)$ encourages the model to produce high scores $P_{\theta}^h(w)$ for the word from dataset distribution and low scores $P_{\theta}^h(w)$ for the word from noise distribution, the model will still work well.
Noise Contrastive Estimation In Practice
Remember we have defined the objective function $J^{h}(\theta)$ for the context $h$.
$$
J^{h}(\theta) = \mathbb{E}_{w \sim P_d^h(w)}\big[\log \frac{P_{\theta}^h(w)}{P_{\theta}^h(w) + kP_n(w)} \big] + k \mathbb{E}_{w \sim P_n(w)}\big[\log \frac{kP_n(w)}{P_{\theta}^h(w) + kP_n(w)} \big]
$$
In practice, to estimate the objective function $J^{h}(\theta)$, we have $m$ target word $w_i$s corresponding to context $h$ sampled from dataset, and $n$ word $w_j$s sampled from noise distribution.
$$
{\widehat{J^{h}}}(\theta) = \frac{1}{m} \sum_{i=1}^{m} \log \frac{P_{\theta}^h(w_i)}{P_{\theta}^h(w_i) + kP_n(w_i)} + \frac{k}{n} \sum_{j=1}^{n} \log \frac{kP_n(w_j)}{P_{\theta}^h(w_j) + kP_n(w_j)}
$$
In the original paper “A Fast and Simple Algorithm for Training Neural Probabilistic Language Models”, the authors suggested to use one word $w_0$ from dataset and $k$ words $w_1, w_2, \cdots, w_k$ from noise distributions, i.e., $m = 1$, and $n = k$.
$$
{\widehat{J^{h}}}(\theta) = \log \frac{P_{\theta}^h(w_0)}{P_{\theta}^h(w_0) + kP_n(w_0)} + \sum_{j=1}^{k} \log \frac{kP_n(w_j)}{P_{\theta}^h(w_j) + kP_n(w_j)}
$$
Unfortunately, almost all the people misinterpreted it as $m$ has to be 1 and $n$ has to be $k$. This is a huge misunderstanding and I will elaborate it in the next section.
As I mentioned previously, if the number of different context is large, the number of learned parameters for the normalizer $c^h$ will also be large. This is not favorable for complex problems. The authors have found that by setting $\exp(c^h) = 1$, the model still works very well. This is probably the model is learning a “self-normalized” distribution. So in practice, we set $\exp(c^h) = 1$, i.e., there is no learned parameters for the normalizer, and $P_{\theta}^h(w)$ = $P_{\theta^0}^h(w)$ = $u_{\theta^0}(w,h)$ = $\exp(s_{\theta^0}(w,h))$. Note that because there is no $c^h$ anymore, in the model, $\theta = \{\theta^0\}$.
$$
\begin{aligned}
{\widehat{J^{h}}}(\theta) &= \frac{1}{m} \sum_{i=1}^{m} \log \frac{P_{\theta}^h(w_i)}{P_{\theta}^h(w_i) + kP_n(w_i)} + \frac{k}{n} \sum_{j=1}^{n} \log \frac{kP_n(w_j)}{P_{\theta}^h(w_j) + kP_n(w_j)} \\
&= \frac{1}{m} \sum_{i=1}^{m} \log \frac{\exp(s_{\theta^0}(w_i,h))}{\exp(s_{\theta^0}(w_i,h)) + kP_n(w_i)} + \frac{k}{n} \sum_{j=1}^{n} \log \frac{kP_n(w_j)}{\exp(s_{\theta^0}(w_j,h)) + kP_n(w_j)}
\end{aligned}
$$
This looks great. But we could take a step back and further transform the object function $J^{h}(\theta)$ to something we are familiar with.
Because
$$
\begin{aligned}
p^h(D=1|w, \theta) &= \frac{P_{\theta}^h(w)}{P_{\theta}^h(w) + kP_n(w)} \\
&= \frac{1}{1 + \frac{kP_n(w)}{P_{\theta}^h(w)}} \\
&= \frac{1}{1 + \exp(\log \frac{kP_n(w)}{P_{\theta}^h(w)} )} \\
&= \frac{1}{1 + \exp(\log kP_n(w) - \log P_{\theta}^h(w) )} \\
&= \frac{1}{1 + \exp(- (\log P_{\theta}^h(w) - \log kP_n(w)))} \\
\end{aligned}
$$
We notice that this is sigmoid function.
$$
p^h(D=1|w, \theta) = \sigma(\log P_{\theta}^h(w) - \log kP_n(w))
$$
With the settings of $\exp(c^h) = 1$, we further have
$$
\begin{aligned}
p^h(D=1|w, \theta) &= \sigma(\log P_{\theta}^h(w) - \log kP_n(w)) \\
&= \sigma(\log \exp(s_{\theta^0}(w,h)) - \log kP_n(w)) \\
&= \sigma(s_{\theta^0}(w,h) - \log kP_n(w)) \\
&= \sigma(\Delta s_{\theta^0}(w,h))
\end{aligned}
$$
$$
\begin{aligned}
p^h(D=0|w, \theta) &= 1 - \sigma(\Delta s_{\theta^0}(w,h))
\end{aligned}
$$
where $\Delta s_{\theta^0}(w,h) = s_{\theta^0}(w,h) - \log kP_n(w)$.
$J^{h}(\theta)$ and ${\widehat{J^{h}}}(\theta)$ now becomes
$$
J^{h}(\theta) = \mathbb{E}_{w \sim P_d^h(w)}\big[\log \sigma(\Delta s_{\theta^0}(w,h)) \big] + k \mathbb{E}_{w \sim P_n(w)}\big[\log (1 - \sigma(\Delta s_{\theta^0}(w,h))) \big]
$$
$$
\begin{aligned}
{\widehat{J^{h}}}(\theta) &= \frac{1}{m} \sum_{i=1}^{m} \log \sigma(\Delta s_{\theta^0}(w_i,h)) + \frac{k}{n} \sum_{j=1}^{n} \log (1 - \sigma(\Delta s_{\theta^0}(w_j,h)))
\end{aligned}
$$
Now the two $\log$ terms in the ${\widehat{J^{h}}}(\theta)$ could be calculated using log loss!
It should be noted that the above object functions are specific to context $h$. The global objective function is defined as follows using the concept similar to “batch” in machine learning.
$$
\begin{aligned}
J(\theta) = \sum\limits_{h} P(h) J^{h}(\theta)\\
{\widehat{J}}(\theta) = \frac{1}{b} \sum_{i=1}^{b} {\widehat{J^{h_i}}}(\theta)
\end{aligned}
$$
To summarize how to do NCE in practice, for each context and word pair $(h, w)$ in the dataset, in this case, $m = 1$, and we sample $n$ words from noise distribution. We denote the target word as $w_0$ and set its label as 1, and words from noise as { $w_1, w_2, \cdots, w_n$ }, and set their label as 0. We compute the $\Delta s_{\theta^0}(w,h)$, which is the digits for word subtracted by the log of the expected number of the word frequency in $k$ samples under the noise distribution, for each of the $1+n$ words. Then we compute the log loss using the computed $\Delta s_{\theta^0}(w,h)$ and the corresponding label. Finally, we put different pieces together using the recipe we derived above.
$$
\begin{aligned}
{\widehat{J^{h}}}(\theta) &= \frac{1}{m} \sum_{i=1}^{m} \log \sigma(\Delta s_{\theta^0}(w_i,h)) + \frac{k}{n} \sum_{j=1}^{n} \log (1 - \sigma(\Delta s_{\theta^0}(w_j,h)))
\end{aligned}
$$
To update the model parameters, simply do back propagation.
Misunderstandings of Noise Contrastive Estimation
To the best of my knowledge, in all the blog posts on NCE I have seen, and even the implementation of NCE loss in TensorFlow, they could not distinguish the noise to data ratio $k$ and the number of samples $n$ from noise distribution. In the TensorFlow implementation, implicitly I can tell the author thinks $n$ and $k$ are the same things. They simply added the binary softmax loss (log loss) for the data from the dataset and the binary softmax losses for $n$ samples from noise distributions and serve it as the NCE loss without even providing a variable for the noise to data ratio $k$. Although it still works well when $n$ is large because then $k$ will also be large. More scientifically speaking, we should distinguish these two parameters well. We use large $k$ because we want to make NCE optimization behave like MLE optimization. We use large $n$ because we want to estimate the loss contribution from the noise distribution better.
FAQs
Does Noise Contrastive Estimation Accelerate Model Inference?
It does not. During inference, the target word distribution is unknown. To predict the target word, we need to iterate over all the words in the vocabulary and ask if the word is from the dataset distribution or noise distribution.
To sample the target word, heuristically, we can compute the scores for all the words in the vocabulary, normalize them so that it becomes a probability distribution, and then sample from it. But note that this is an orchestrated probability distribution which never gets learned during the training process.
Why We Can’t Learn a Normalizer for Maximum Likelihood Estimation?
Noise contrastive estimation turns a multi-class classification problem into multiple binary classification problems.
In fact, noise contrastive estimation does not learn the normalizer. Instead, it actually just learns a scale factor for the scores for computing the binary classification probability. The scale factor is not critical and its effect becomes negligible when the number of noise samples $k$ becomes large. That’s why one can even set the scale factor $\exp(c^h)$ to 1 during the training process. Considering the binary classification log loss, there is no need to learn the normalizer.
$$
\begin{aligned}
\log \frac{P_{\theta}^h(w)}{P_{\theta}^h(w) + kP_n(w)}
&= \log \frac{P_{\theta^0}^h(w) \exp(c^h)}{P_{\theta^0}^h(w) \exp(c^h) + kP_n(w)} \\
&= \log \frac{P_{\theta^0}^h(w)}{P_{\theta^0}^h(w) + \frac{k}{\exp(c^h)} P_n(w)}
\end{aligned}
$$
$$
\begin{aligned}
\log \frac{kP_n(w)}{P_{\theta}^h(w) + kP_n(w)}
&= \log \left( 1 - \frac{P_{\theta}^h(w)}{P_{\theta}^h(w) + kP_n(w)} \right) \\
&= \log \left( 1 - \frac{P_{\theta^0}^h(w)}{P_{\theta^0}^h(w) + \frac{k}{\exp(c^h)} P_n(w)} \right) \\
\end{aligned}
$$
The marginal likelihood, i.e., the normalizer, in maximum likelihood is more complicated to compute, and cannot be benefited from just computing the score of one target word. At least, I have not been aware of any method to simplify the computation of the normalizer in maximum likelihood estimation during neural network training.
References
Noise Contrastive Estimation
https://leimao.github.io/article/Noise-Contrastive-Estimation/