Maximum Likelihood Estimation of N-Gram Model Parameters
Introduction
A language model is a probability distribution over sequences of words, namely:
$$p(w_1, w_2, w_3, …, w_n)$$
According to the chain rule,
$$
p(w_1, w_2, w_3, …, w_n) = p(w_1)p(w_2|w_1)p(w_3|w_2,w_1)…p(w_n|w_{n-1},w_{n-2},…,w_1)
$$
However, the parameters for this language model are $p(w_1)$, $p(w_2|w_1)$, …, $p(w_n|w_{n-1},…,w_1)$, which are usually too computationally expensive to calculate especially for the conditional probability with many conditioning words, even with a small dataset.
To approximate $p(w_1, w_2, w_3, …, w_n)$, we could use N-Gram models to approximate the language model, namely:
N-Gram Model
$$
p(w_1, w_2, w_3, …, w_n) \approx p(w_1)p(w_2|w_1)p(w_3|w_2,w_1)…p(w_n|w_{n-1},w_{n-2},…,w_{n-N})
$$
In particular, we usually use unigram model, bigram model and trigram model in language modelings.
Unigram Model
$$
p(w_1, w_2, w_3, …, w_n) \approx p(w_1)p(w_2)p(w_3)…p(w_n)
$$
Bigram Model
$$
p(w_1, w_2, w_3, …, w_n) \approx p(w_1)p(w_2|w_1)p(w_3|w_2)…p(w_n|w_{n-1})
$$
Trigram Model
$$
p(w_1, w_2, w_3, …, w_n) \approx p(w_1)p(w_2|w_1)p(w_3|w_2,w_1)…p(w_n|w_{n-1},w_{n-2})
$$
With the N-Gram model approximations, calculating $p(w_n|w_{n-1},w_{n-2},…,w_{n-N})$ is usually not too computationally expensive.
Maximum Likelihood Estimation of N-Gram Model Parameters
To estimate $p(w_n|w_{n-1},w_{n-2},…,w_{n-N})$, an intuitive way is to do maximum likelihood estimation (MLE).
Maximum likelihood estimation estimates the model parameters such that the probability is maximized.
In our case, the parameters are $p(w_n|w_{n-1},w_{n-2},…,w_{n-N})$, and the probability we maximizes is $p(w_1)p(w_2|w_1)p(w_3|w_2,w_1)…p(w_n|w_{n-1},w_{n-2},…,w_{n-N})$
In practice, we simply count the occurrence of word patterns to calculate the maximum likelihood estimation of $p(w_n|w_{n-1},w_{n-2},…,w_{n-N})$.
Unigram Model
$$p(w_i) = \frac{c(w_i)}{\sum_{w}^{} c(w)}$$
Bigram Model
$$p(w_i|w_{i-1}) = \frac{c(w_{i-1},w_i)}{\sum_{w}^{} c(w_{i-1},w)}$$
Trigram Model
$$p(w_i|w_{i-1},w_{i-2}) = \frac{c(w_{i-2},w_{i-1},w_i)}{\sum_{w}^{} c(w_{i-2},w_{i-1},w)}$$
Now the question becomes why these formulas are the maximum likelihood estimations. Most of the books and online tutorials only give these formulas without showing formal mathematical proof.
Here I am going to rigorously show that these are actually the formulas of maximum likelihood estimation.
Mathematical Derivation of Maximum Likelihood Estimation of N-Gram Model Parameters
Unigram Model
Let us warm up with unigram model.
We have a collection of unique words, $w_1, w_2, …, w_n$.
For any given sequence of words $\mathbf{w}$ of length $N$ ($\mathbf{w} = (w_1, w_2, w_1, w_5, w_7, w_2)$ for example), we have
$$
\begin{aligned}
p(\mathbf{w})
& = p(w_1)^{c(w_1)} p(w_2)^{c(w_2)} p(w_3)^{c(w_3)}…p(w_n)^{c(w_n)}\\
& = \prod_{i=1}^{n}p(w_i)^{c(w_i)}
\end{aligned}
$$
where $c(w_i)$ is the count of word $w_i$ in the sentence.
We take the log of $p(\mathbf{w})$, we then have:
$$
\begin{aligned}
\log{p(\mathbf{w})}
& = c(w_1)\log{p(w_1)} + c(w_2)\log{p(w_2)} + c(w_3)\log{p(w_3)} + … + c(w_n)\log{p(w_n)}\\
& = \sum_{i=1}^{n}c(w_i)\log{p(w_i)}
\end{aligned}
$$
To maximize $p(\mathbf{w})$, equivalently we have the following optimization problem:
Maximize $\log{p(\mathbf{w})}$, subject to $\forall i \in [1 \dotsc N]$, $\sum_{i = 1}^{n} p(w_i) = 1$.
Equivalently, we introduce auxilliary optimization function using Lagrange multiplier ($\sum_{i=1}^{n}p(w_i)-1 = 0$):
$$
\mathcal{L} = \sum_{i=1}^{n}c(w_i)\log{p(w_i)} + \lambda(\sum_{i=1}^{n}p(w_i)-1)
$$
For any $p(w_j)$, we take the derivatives of $\mathcal{L}$ respective to $p(w_j)$:
$$\frac{\partial \mathcal{L}}{\partial p(w_j)} = \frac{c(w_j)}{p(w_j)} + \lambda = 0$$
$$p(w_j) = -\frac{c(w_j)}{\lambda}$$
Because $\sum_{i=1}^{n}p(w_i) = 1$, we have:
$$\sum_{i=1}^{n}p(w_i) = \sum_{i=1}^{n} -\frac{c(w_i)}{\lambda} = \frac{\sum_{i=1}^{n} c(w_i)}{-\lambda} = 1$$
$$\lambda = - \sum_{i=1}^{n} c(w_i)$$
Because $p(w_j) = -c(w_j)/{\lambda}$, therefore
$$p(w_j) = \frac{c(w_j)}{\sum_{i=1}^{n} c(w_i)}$$
This concludes the proof.
Bigram Model
Now let us move on to bigram model to see what is different.
We have a collection of unique words, $w_1, w_2, …, w_n$.
For the conditional probabilities, we have $n \times n$ possibilities.
For any given sequence of words $\mathbf{w}$ of length $N$ ($\mathbf{w} = (w_1, w_2, w_1, w_5, w_7, w_2)$ for example), we have
$$
\begin{aligned}
p(\mathbf{w})
& = \prod_{i=1}^{n} p(w_i)^{s(w_i)} \prod_{i=1}^{n} \prod_{j=1}^{n} p(w_j|w_i)^{c(w_i, w_j)}
\end{aligned}
$$
where $c(w_i, w_j)$ is the count of word sequence $w_i, w_j$ in the sentence and
$$
s(w_i) = \begin{cases}
1, & \text{if $w_i$ is the first word}\\
0, & \text{otherwise}
\end{cases}
$$
We take the log of $p(\mathbf{w})$, we then have:
$$
\begin{aligned}
\log{p(\mathbf{w})}
& = \sum_{i=1}^{n}s(w_i)\log{p(w_i)} + \sum_{i=1}^{n}\sum_{j=1}^{n}c(w_i,w_j)\log{p(w_j|w_i)}
\end{aligned}
$$
To maximize $p(\mathbf{w})$, equivalently we have the following optimization problem:
Maximize $\log{p(\mathbf{w})}$, subject to $\forall i \in [1 \dotsc N]$, $\sum_{j = 1}^{n} p(w_j|w_i) = 1$.
Equivalently, we introduce auxiliary optimization function using Lagrange multiplier ($\sum_{j = 1}^{n} p(w_j|w_i)-1 = 0$):
$$
\mathcal{L} = \sum_{i=1}^{n}s(w_i)\log{p(w_i)} + \sum_{i=1}^{n}\sum_{j=1}^{n}c(w_i,w_j)\log{p(w_j|w_i)} + \sum_{i=1}^{n} \lambda_i \bigg( \big(\sum_{j = 1}^{n} p(w_j|w_i) \big) - 1 \bigg)
$$
For any $p(w_k|w_i)$, we take the derivatives of $\mathcal{L}$ respective to $p(w_k|w_i)$:
$$\frac{\partial \mathcal{L}}{\partial p(w_k|w_i)} = \frac{c(w_i, w_k)}{p(w_k|w_i)} + \lambda_i = 0$$
$$p(w_k|w_i) = -\frac{c(w_i, w_k)}{\lambda_i}$$
Because $\sum_{j = 1}^{n} p(w_j|w_i) = 1$, we have:
$$\sum_{j = 1}^{n} p(w_j|w_i) = \sum_{j = 1}^{n} -\frac{c(w_i, w_j)}{\lambda_i} = \frac{\sum_{j = 1}^{n} c(w_i, w_j)}{-\lambda_i} = 1$$
$$\lambda_i = -\sum_{j = 1}^{n} c(w_i, w_j)$$
Because $p(w_k|w_i) = -c(w_i, w_k)/\lambda_i$, therefore
$$p(w_k|w_i) = \frac{c(w_i, w_k)}{\sum_{j = 1}^{n} c(w_i, w_j)}$$
This concludes the proof.
N-Gram Model
Without losing generality, the maximum likelihood estimation of n-gram model parameters could also be proven in the same way.
Conclusion
Mathematics is important for (statistical) machine learning.
Maximum Likelihood Estimation of N-Gram Model Parameters
https://leimao.github.io/blog/Maximum-Likelihood-Estimation-Ngram/