### Introduction

BLEU is a standard algorithm for evaluating the machine translations against the human translations. At first I thought it should be very straightforward to use. However, it turns out that there are a lot of caveats.

In this blog post, I am going to show the BLEU algorithm in detail and talk about the caveats.

### English Translation Example

We will use the following examples to illustrate how to compute the BLEU scores.

#### Example 1

Chinese: 猫坐在垫子上

Reference 1: the cat is on the mat

Reference 2: there is a cat on the mat

Candidate: the cat the cat on the mat

#### Example 2

Chinese: 猫坐在垫子上

Reference 1: the cat is on the mat

Reference 2: there is a cat on the mat

Candidate: the the the the the the the the

### Precision

We count each of the ngram in the candidate sentence whether it has shown in any of the reference sentences, gather the total counts for each of the unique ngram, sum up the total counts for each of the unique ngram, and divided by the number of ngrams in the candidate sentence.

#### Example 1

We first compute the unigram precision for example 1. All the unigrams in the candidate sentences have shown in the reference sentences.

Unigram | Shown? |
---|---|

the | 1 |

cat | 1 |

the | 1 |

cat | 1 |

on | 1 |

the | 1 |

mat | 1 |

We then merge the unigram counts.

Unique Unigram | Count |
---|---|

the | 3 |

cat | 2 |

on | 1 |

mat | 1 |

The total number of counts for the unique unigrams in the candidate sentence is 7, and the total number of unigrams in the candidate sentence is 7. The unigram precision is 7/7 = 1.0 for example 1.

We then try to compute the bigram precision for example 1.

Bigram | Shown? |
---|---|

the cat | 1 |

cat the | 0 |

the cat | 1 |

cat on | 1 |

on the | 1 |

the mat | 1 |

We then merge the bigram counts.

Unique Bigram | Count |
---|---|

the cat | 2 |

cat the | 0 |

cat on | 1 |

on the | 1 |

the mat | 1 |

The total number of counts for the unique bigrams in the candidate sentence is 5, and the total number of bigrams in the candidate sentence is 6. The bigram precision is 5/6 = 0.833 for example 1.

#### Example 2

We first compute the unigram precision for example 2. All the unigrams in the candidate sentences have shown in the reference sentences.

Unigram | Shown? |
---|---|

the | 1 |

the | 1 |

the | 1 |

the | 1 |

the | 1 |

the | 1 |

the | 1 |

the | 1 |

We then merge the unigram counts.

Unique Unigram | Count |
---|---|

the | 8 |

The total number of counts for the unique unigrams in the candidate sentence is 8, and the total number of unigrams in the candidate sentence is 8. The unigram precision is 8/8 = 1.0 for example 2.

We then try to compute the bigram precision for example 2.

Bigram | Shown? |
---|---|

the the | 0 |

the the | 0 |

the the | 0 |

the the | 0 |

the the | 0 |

the the | 0 |

the the | 0 |

We then merge the bigram counts.

Unique Bigram | Count |
---|---|

the the | 0 |

The total number of counts for the unique bigrams in the candidate sentence is 0, and the total number of bigrams in the candidate sentence is 7. The bigram precision is 0/7 = 0 for example 2.

#### Drawbacks

We can see from example 1 and 2 that unigram precision is very easy to be over-confident about the quality of the machine translation. To overcome this, clipped count and modified precision were proposed.

### Modified Precision

For each unique ngram, we count its maximum frequency in each of the reference sentences. The minimum of this special count and the original count is called the clipped the count. That is to say, the clipped count is no greater than the original count. We then use this clipped count, in place of the original count, for computing the modified precision.

#### Example 1

Unique Unigram | Count | Clipped Count |
---|---|---|

the | 3 | 2 |

cat | 2 | 1 |

on | 1 | 1 |

mat | 1 | 1 |

The total number of clipped counts for the unique unigrams in the candidate sentence is 5, and the total number of unigrams in the candidate sentence is 7. The unigram modified precision is 5/7 = 0.714 for example 1.

Unique Bigram | Count | Clipped Count |
---|---|---|

the cat | 2 | 1 |

cat the | 0 | 0 |

cat on | 1 | 1 |

on the | 1 | 1 |

the mat | 1 | 1 |

The total number of clipped counts for the unique bigrams in the candidate sentence is 4, and the total number of unigrams in the candidate sentence is 6. The bigram modified precision is 4/6 = 0.667 for example 1.

#### Example 2

Unique Unigram | Count | Clipped Count |
---|---|---|

the | 8 | 2 |

The total number of clipped counts for the unique bigrams in the candidate sentence is 0, and the total number of unigrams in the candidate sentence is 8. The unigram modified precision is 2/8 = 0.25 for example 2.

Unique Bigram | Count | Clipped Count |
---|---|---|

the the | 0 | 0 |

The total number of clipped counts for the unique bigrams in the candidate sentence is 0, and the total number of bigrams in the candidate sentence is 7. The bigram precision is 0/7 = 0 for example 2.

#### Advantages

Compared to precision, we found that modified precision is a better metric, at least for unigrams.

### BLEU

#### Algorithm

BLEU is computed using a couple of ngram modified precisions. Specifically,

\[\text{BLEU} = \text{BP} \cdot \exp \bigg( \sum_{n=1}^{N} w_n \log p_n \bigg)\]where $p_n$ is the modified precision for $n$gram, the base of $\log$ is the natural base $e$, $w_n$ is weight between 0 and 1 for $\log p_n$ and $\sum_{n=1}^{N} w_n = 1$, and BP is the brevity penalty to penalize short machine translations.

\[\text{BP} = \begin{cases} 1 & \text{if } c > r \\ \exp \big(1-\frac{r}{c}\big) & \text{if } c \leq r \end{cases}\]where $c$ is the number of unigrams (length) in all the candidate sentences, and $r$ is the best match lengths for each candidate sentence in the corpus. Here the best match length is the closest reference sentence length to the candidate sentences. For example, if there are three references with lengths 12, 14, and 17 words and the candidate translation is a terse 13 words, ideally the best match length could be either 12 or 14, but we arbitrary choose the shorter one which is 12.

Usually, the BLEU is evaluated on corpus where there are many candidate sentences translated from different source texts and each of them has several reference sentences. Then $c$ is the total number of unigrams (length) in all the candidate sentences, and $r$ is the sum of the best match lengths for each candidate sentence in the corpus.

It is not hard to find that BLEU is always a value between 0 and 1. It is because BP, $w_n$, and $p_n$ are always between 0 and 1, and

\[\begin{align} \exp \bigg( \sum_{n=1}^{N} w_n \log p_n \bigg) &= \prod_{n=1}^{N} \exp \big( w_n \log p_n \big) \\ &= \prod_{n=1}^{N} \Big[ \exp \big( \log p_n \big) \Big]^{w_n} \\ &= \prod_{n=1}^{N} {p_n}^{w_n} \\ &\in [0,1] \end{align}\]Usually, BLEU uses $N = 4$ and $w_n = \frac{1}{N}$.

#### Example 1

We have computed the modified precision for some of the ngrams. It is not hard to compute the others. Concretely, we have

\[p_1 = \frac{5}{7}\\ p_2 = \frac{4}{6}\\ p_3 = \frac{2}{5}\\ p_4 = \frac{1}{4}\] \[w_1 = w_2 = w_3 = w_4 = \frac{1}{4}\]Because the corpus only has one translation set and thus $c = 7$ and $r = 7$

\[\text{BP} = 1\]We plugin these values to the BLEU equation, the BLEU is

\[\text{BLEU} = 0.467\]We further compare the BLEU to the BLEU computed using NLTK.

```
>>> import nltk
>>> reference_1 = "the cat is on the mat".split()
>>> reference_2 = "there is a cat on the mat".split()
>>> candidate = "the cat the cat on the mat".split()
>>> bleu = nltk.translate.bleu_score.sentence_bleu(references=[reference_1, reference_2], hypothesis=candidate, weights=(0.25,0.25,0.25,0.25))
>>> print(bleu)
0.4671379777282001
```

The value of `bleu`

is 0.467 which is exactly matching to the BLEU we computed manually.

#### Example 2

Similarly,

\[p_1 = \frac{2}{8}\\ p_2 = \frac{0}{7}\\ p_3 = \frac{0}{6}\\ p_4 = \frac{0}{5}\] \[w_1 = w_2 = w_3 = w_4 = \frac{1}{4}\]Because the corpus only has one translation set and thus $c = 8$ and $r = 7$

\[\text{BP} = 1\]When we plugin these values to the BLEU equation, actually we would need to compute $\log 0$ which is not mathematically defined. We use a small number $10^{-100}$ instead of $0$ for $p_2$, $p_3$ and $p_4$. The BLEU is

\[\text{BLEU} = 0\]We further also compare the BLEU to the BLEU computed using NLTK.

```
>>> import nltk
>>> reference_1 = "the cat is on the mat".split()
>>> reference_2 = "there is a cat on the mat".split()
>>> candidate = "the the the the the the the the".split()
>>> bleu = nltk.translate.bleu_score.sentence_bleu(references=[reference_1, reference_2], hypothesis=candidate, weights=(0.25,0.25,0.25,0.25))
>>> print(bleu)
1.2882297539194154e-231
```

The value of `bleu`

is 0 which is exactly matching to the BLEU we computed manually.

Note that in the above two examples, due to the candidate sentence is long and we only have one translation in the corpus, thus $\text{BP} = 1$. In practice, there could be scenarios where $\text{BP} < 1$.

### Caveats

In some scenarios, BLEU does not score the translation very well, especially for those short translations with few reference sentences. For example,

Chinese: 你准备好了吗？

Reference 1: are you ready ?

Candidate: you are ready ?

```
>>> import nltk
>>> reference_1 = "are you ready ?".split()
>>> candidate = "you are ready ?".split()
>>> bleu = nltk.translate.bleu_score.sentence_bleu(references=[reference_1], hypothesis=candidate, weights=[0.25,0.25,0.25,0.25])
>>> print(bleu)
1.133422688662942e-154
```

This is actually a very good machine translation to me. However, the BLEU score is 0, which means that the machine translation is totally wrong.

In NLTK, you are allowed to provide smoothing functions. For example,

```
>>> import nltk
>>> reference_1 = "are you ready ?".split()
>>> candidate = "you are ready ?".split()
>>> bleu = nltk.translate.bleu_score.sentence_bleu(references=[reference_1], hypothesis=candidate, weights=[0.25,0.25,0.25,0.25], smoothing_function=nltk.translate.bleu_score.SmoothingFunction().method7)
>>> print(bleu)
0.4002926439114545
```

This time, the value of `bleu`

is 0.4, which is magically higher than the vanilla one we computed without using smoothing functions.

However, one should be always cautious about the smoothing function used in BLEU computation. At least we have to make sure that the BLEU scores we are comparing against are using no smoothing function or the exact same smoothing function.