### Introduction

A language model is a probability distribution over sequences of words, namely:

According to the chain rule,

However, the parameters for this language model are $p(w_1)$, $p(w_2|w_1)$, …, $p(w_n|w_{n-1},…,w_1)$, which are usually too computationally expensive to calculate especially for the conditional probability with many conditioning words, even with a small dataset.

To approximate $p(w_1, w_2, w_3, …, w_n)$, we could use N-Gram models to approximate the language model, namely:

#### N-Gram Model

In particular, we usually use unigram model, bigram model and trigram model in language modelings.

#### Unigram Model

#### Bigram Model

#### Trigram Model

With the N-Gram model approximations, calculating $p(w_n|w_{n-1},w_{n-2},…,w_{n-N})$ is usually not too computationally expensive.

### Maximum Likelihood Estimation of N-Gram Model Parameters

To estimate $p(w_n|w_{n-1},w_{n-2},…,w_{n-N})$, an intuitive way is to do maximum likelihood estimation (MLE).

Maximum likelihood estimation estimates the model parameters such that the probability is maximized.

In our case, the parameters are $p(w_n|w_{n-1},w_{n-2},…,w_{n-N})$, and the probability we maximizes is $p(w_1)p(w_2|w_1)p(w_3|w_2,w_1)…p(w_n|w_{n-1},w_{n-2},…,w_{n-N})$

In practice, we simply count the occurrence of word patterns to calculate the maximum likelihood estimation of $p(w_n|w_{n-1},w_{n-2},…,w_{n-N})$.

#### Unigram Model

#### Bigram Model

#### Trigram Model

Now the question becomes why these formulas are the maximum likelihood estimations. Most of the books and online tutorials only gives these formulas without showing the formal mathematical proof.

Here I am going to rigorously show that these are actually the formulas of maximum likelihood estimation.

### Mathematical Derivation of Maximum Likelihood Estimation of N-Gram Model Parameters

#### Unigram Model

Let us warm up with unigram model.

We have a collection of unique words, $w_1, w_2, …, w_n$.

For any given sequence of words $\mathbf{w}$ of length $N$ ($\mathbf{w} = (w_1, w_2, w_1, w_5, w_7, w_2)$ for example), we have

where $c(w_i)$ is the count of word $w_i$ in the sentence.

We take the log of $p(\mathbf{w})$, we then have:

To maximize $p(\mathbf{w})$, equivalently we have the following optimization problem:

Maximize $\log{p(\mathbf{w})}$, subject to $\forall i \in [1 \dotsc N]$, $\sum_{i = 1}^{n} p(w_i) = 1$.

Equivalently, we introduce auxilliary optimization function using Lagrange multiplier ($\sum_{i=1}^{n}p(w_i)-1 = 0$):

For any $p(w_j)$, we take the derivatives of $\mathcal{L}$ respective to $p(w_j)$:

Because $\sum_{i=1}^{n}p(w_i) = 1$, we have:

Because $p(w_j) = -c(w_j)/{\lambda}$, therefore

This concludes the proof.

#### Bigram Model

Now let us move on to bigram model to see what is different.

We have a collection of unique words, $w_1, w_2, …, w_n$.

For the conditional probabilities, we have $n \times n$ possibilities.

For any given sequence of words $\mathbf{w}$ of length $N$ ($\mathbf{w} = (w_1, w_2, w_1, w_5, w_7, w_2)$ for example), we have

where $c(w_i, w_j)$ is the count of word sequence $w_i, w_j$ in the sentence and

We take the log of $p(\mathbf{w})$, we then have:

To maximize $p(\mathbf{w})$, equivalently we have the following optimization problem:

Maximize $\log{p(\mathbf{w})}$, subject to $\forall i \in [1 \dotsc N]$, $\sum_{j = 1}^{n} p(w_j|w_i) = 1$.

Equivalently, we introduce auxilliary optimization function using Lagrange multiplier ($\sum_{j = 1}^{n} p(w_j|w_i)-1 = 0$):

For any $p(w_k|w_i)$, we take the derivatives of $\mathcal{L}$ respective to $p(w_k|w_i)$:

Because $\sum_{j = 1}^{n} p(w_j|w_i) = 1$, we have:

Because $p(w_k|w_i) = -c(w_i, w_k)/\lambda_i$, therefore

This concludes the proof.

#### N-Gram Model

Without generality, the maximum likelihood estimation of n-gram model parameters could also be proven in the same way.

### Conclusion

Mathematics is important for (statistical) machine learning.