### Introduction

The concept of entropy has been widely used in machine learning and deep learning. In this blog post, I will first talk about the concept of entropy in information theory and physics, then I will talk about how to use perplexity to measure the quality of language modeling in natural language processing.

### Shannon Entropy

#### Definition

Shannon entropy is defined as

\[H(p) = \mathbb{E}[I(X)] = \mathbb{E}[\log_b p]\]where $b$ is the base of the logarithm used.

For discrete case, suppose we have $n$ discrete states,

\[H(p) = - \sum_{i=1}^{n} p(x_i) \log_b p(x_i)\]where $p(x_i)$ is the probability of the system being in the state $i$, and $\sum_{i=1}^{n} p(x_i) = 1$.

#### Meanings in Information Theory

On Wikipedia, the definition of Shannon entropy is the average rate at which information is produced by a stochastic source of data. This sounds a little bit confusing to beginners. Let’s see how to understand it first.

Let’s assume one system could only be in four states, A, B, C, and D of equal probabilities. That is to say, $P(A) = P(B) = P(C) = P(D) = \frac{1}{4}$. When we have a sequence of the states from this system, say, ABCDDCBA, how do we use the minimum amount of the memory to store this sequence? This is equivalent to ask what is the encoding method that is mostly memory saving? Of course, once you encode the sequence and store the sequence on memory, you need to have some way to decode the sequence from memory and recover the information. We assume our memory only stores binary values 0 and 1, it is very intuitive to think of that if we encode A = 00, B = 01, C = 10, D = 11, the encoded sequence for ABCDDCBA in binary format will be 0001101111100100. We also have ways to decode this binary format to the original information, we read the binary sequence 2 bits a time, we have 00|01|10|11|11|10|01|00, and we check the encoding table, we will get the original state sequence ABCDDCBA unambiguously. In this system, for an infinitely long sequence of the random states, on average, each state will require two bits to store on memory. In fact, Shannon entropy told us that 2 bits is the minimum to encode on average for each state in the sequence.

\[\begin{aligned} H(p) &= - \sum_{i=1}^{n} p(x_i) \log_2 p(x_i) \\ &= - 4 \frac{1}{4} \log_2 \frac{1}{4} \\ &= 2\\ \end{aligned}\]Note that because we are using binary coding, the base of logarithm is 2. If you are decimal coding, the base will be 10.

You may use the following way to encode the same system, say A = 000, B = 001, C = 010, D = 011. The encoded sequence could also be decoded unambiguously. But on average, for each state in the infinite random state sequence, the number of bits used to store each state is 3, which is larger than 2. You will have no solution to use 1 bit on average for each state to store the infinite random state sequence.

Let’s checkout another system. Let’s assume the new system could only be in three states, A, B, C. $P(A) = \frac{1}{2}$ and $P(B) = P(C) =\frac{1}{4}$. Using Shannon entropy equation, we got the Shannon entropy for this system,

\[\begin{aligned} H(p) &= - \sum_{i=1}^{n} p(x_i) \log_2 p(x_i) \\ &= - ( \frac{1}{2} \log_2 \frac{1}{2} + 2 \frac{1}{4} \log_2 \frac{1}{4}) \\ &= 1.5\\ \end{aligned}\]This means that on average, we could use 1.5 bits for each state to store an infinitely long sequence of the random states. Then let’s see how can we do this.

We encode the states, A = 1, B = 00, C = 01. For an infinite long random state sequence, it is not hard to see that we do need $1 \times \frac{1}{2} + 2 \times \frac{1}{4} + 2 \times \frac{1}{4} = 1.5$ bits on average for each state. Can we recover the original state sequence from the encoded sequence? Yes, we can. The decoding strategy is actually not too complicated. We read the encoded binary sequence from the beginning, when the binary bit we read is 1, we take the bit and decode it (to A), when the binary bit we read is 0, we take two bits at one time and decode it. In this way, we could unambiguously recover the original message.

OK, now someone jumps out, claiming that he could use less than 1.5 bits on average for each state in an infinitely long random state sequence for the system. His encoding method is, A = 1, B = 0, C = 01, we only need $1 \times \frac{1}{2} + 1 \times \frac{1}{4} + 2 \times \frac{1}{4} = 1.25$! Well, sounds great, but can you decode the encoded binary sequence back to the original state sequence? No, you can’t. You would have no idea whether you should read one bit or two bits when you see the encoded binary sequence.

#### Meanings in Physics

In Physics, entropy is a measurement of chaos in the system. Suppose we have a system containing $N$ states. To calculate the Shannon entropy of the system, we have the following equation. For simplicity, we use base $N$ for the logarithm.

\[H(p) = - \sum_{i=1}^{N} p(x_i) \log_N p(x_i)\]It is actually not hard to find that the $S \in (0, 1]$ (using Lagrange multipliers).

If the probability of the system being in the state is random, i.e. $p(x_i) = \frac{1}{N}$, we have

\[H(p) = - \sum_{i=1}^{N} \frac{1}{N} \log_N \frac{1}{N} = 1\]If the probability of the system being in one state $k$ is very high, say $P_k \rightarrow 1$, we have $H(p) \rightarrow 0$.

This means that when a system is very chaotic and nondeterministic, the Shannon entropy is very high. When a system is very ordered and deterministic, the Shannon entropy is very low.

### Cross Entropy

#### Definition

For any two distributions $p$ and $q$, cross entropy is defined as

\[H(p, q) = \mathbb{E}_p [\log_b q]\]where b is the base of the logarithm used.

For discrete case, suppose we have $n$ discrete states,

\[H(p, q) = - \sum_{i=1}^{n} p(x_i) \log_b q(x_i)\]#### Meanings in Information Theory

On Wikipedia, it is said the cross entropy between two probability distributions $p$ and $q$ over the same underlying set of events measures the average number of bits needed to identify an event drawn from the set if a coding scheme used for the set is optimized for an estimated probability distribution $q$, rather than the true distribution $p$. This sounds bluffing. What does it really mean then? Let’s check a concrete example.

Let’s assume the system $\alpha$ could only be in three states, A, B, C. We have distribution $p$ for the three states $P(A) = \frac{1}{2}$ and $P(B) = P(C) =\frac{1}{4}$. As we have calculated in the previous section, the Shannon entropy of this system $H(p) = 1.5$ and the coding for the optimal coding for the three states are A = 1, B = 00, C = 01. We also have another distribution $q$ for the three states $P(B) = \frac{1}{2}$ and $P(A) = P(C) =\frac{1}{4}$, if there is a system $\beta$ with the exact same three states using the distribution $q$, the optimal coding for the three states will be A = 00, B = 1, C = 01, and of course the Shannon entropy for this system will still be $H(q) = 1.5$. Then if we have an infinite random state sequence from system $\alpha$, each random state follows distribution $p$, and we use the optimized coding method from system $\beta$ which uses distribution $q$, how many bits on average is required for each state? To calculate this expected value, it is $2 \times \frac{1}{2} + 1 \times \frac{1}{4} + 2 \times \frac{1}{4} = 1.75$. It is actually the cross entropy $H(p, q)$.

\[\begin{aligned} H(p, q) &= - \sum_{i=1}^{n} p(x_i) \log_b q(x_i) \\ &= - (\frac{1}{2} \log_2 \frac{1}{4} + \frac{1}{4} \log_2 \frac{1}{2} + \frac{1}{4} \log_2 \frac{1}{4}) \\ &= 1.75 \\ \end{aligned}\]The value of cross entropy will always be greater than $H(p)$. If $q = p$, $H(p, q)$ is minimized and $H(p, q) = H(p) = 1.5$.

### Perplexity

#### Definition

Perplexity is defined as

\[PP(p) = b^{H(p)} = b^{\mathbb{E}[\log_b p]}\]where b is the base of the logarithm used.

For discrete case, suppose we have $n$ discrete states,

\[PP(p) = b^{H(p)} = b^{- \sum_{i=1}^{n} p(x_i) \log_b p(x_i)}\]So it has the exact same information to Shannon entropy, and I don’t even know why people invented this.

#### Perplexity in Language Modeling

Sometimes people will be confused about employing perplexity to measure how well a language model is. It is using almost exact the same concepts that we have talked above. In the above systems, the distribution of the states are already known, and we could calculate the Shannon entropy or perplexity for the real system without any doubt. In language modeling, we are actually building a system with state distribution $q$ which tries to mimic the real-world system with the state distribution $p$ as much as possible. However, in practice we do not know the exact $p$, we only know $\tilde{p}$ which is a sampled distribution from the real-world system. So instead of using the vanilla perplexity defined above, we use a modified perplexity with cross entropy $H({\tilde{p}}, q)$.

\[PP'({\tilde{p}},q) = b^{H({\tilde{p}}, q)} = b^{\mathbb{E}_{\tilde{p}}[\log_b q]}\]For discrete case, suppose we have $n$ discrete states $x_1, x_2, \cdots, x_n$,

\[PP'({\tilde{p}},q) = b^{- \sum_{i=1}^{n} \tilde{p}(x_i) \log_b q(x_i)}\]Because language modeling is a discrete case, we further assume that we have $N$ sentence samples $s_1, s_2, \cdots, s_N$ in the dataset, these $N$ sentence samples represent $n$ discrete states, some of them could be the same

\[PP'({\tilde{p}},q) = b^{- \sum_{i=1}^{n} \tilde{p}(x_i) \log_b q(x_i)} = b^{- \sum_{i=1}^{N} \tilde{p}(s_i) \log_b q(s_i)}\]Because sentence $s_i$ was uniformly sampled from the dataset, $\tilde{p}(s_i) = \frac{1}{N}$. Therefore, the modified perplexity could be further expressed as

\[PP'({\tilde{p}},q) = b^{- \frac{1}{N} \sum_{i=1}^{N} \log_b q(s_i)}\]Sometimes, we define the average log probability of the sentences is

\[\log_b \overline{q}(s) = \frac{1}{N} \sum_{i=1}^{N} \log_b q(s_i)\]Then we have perplexity for sentences

\[PP'(S) = b^{- \log_b \overline{q}(s)}\]Let’s see a concrete example. Without generality, we assume the size of the word corpus is $M$. If the language model is extremely stupid, and it guesses word after word randomly, this means that

\[p(w_i|w_1,w_2,\cdots,w_{i-1}) = \frac{1}{M}\]For any sentence from the language (true sentence distribution) which consists of $t$ words, its probability $q(s)$ calculated from the stupid language model will always be $\frac{1}{M^t}$. Without generality, we assume there are $N$ different sentence combinations and each sentence consists exactly $T$ words. The probability of each sentence $q(s_i)$ from the stupid language model will be exact $\frac{1}{M^T}$. Note that under this assumption, $N = M^T$. If we calculate the perplexity of the language system using the definition of perplexity,

\[\begin{aligned} PP'(S) &= b^{- \frac{1}{M^T} \sum_{i=1}^{N} \log_b q(s_i)} \\ &= b^{- \frac{1}{M^T} \sum_{i=1}^{M^T} \log_b \frac{1}{M^T}} \\ &= M^T \\ \end{aligned}\]If the number of words in the sentences $T = 10$ and the size of word corpus $M = 10^6$, $PP’(S) = 10^{60}$!

However, if the language model has actually learned something, this means that the probability of each sentence $q(s_i)$ will not all be $\frac{1}{C^T}$. For example, for real sentences such “I like eating apples”, it should have a higher probability, while for “fake sentences” such as “zoo airplane drink dogs”, it should lower probability in principle close to 0. This will cause the perplexity of the “smarter” system lower than the perplexity of the stupid system.

So we can see that learning is actually an entropy decreasing process, and we could use fewer bits on average to code the sentences in the language. This is why we use perplexity to evaluate language modeling.

In practice, to evaluate a trained language model, we have an evaluation set of sentences or passages written by real people speaking that language. For each sentence $s_i$ from the evaluation set, we are able to calculate its probability. Usually, we will be able to calculate $p(w_i|w_1,w_2,\cdots,w_{i-1})$, and we will have

\[\begin{aligned} q(s_i) &= q(w_1, w_2, \cdots, w_t) \\ &= \prod_{j=1}^{t_i} p(w_j|w_1,w_2,\cdots,w_{j-1}) \\ \end{aligned}\]If we use $b=2$, and suppose $\log_b \overline{q}(s) = -190$, the language model perplexity will $PP’(S) = 2^{190}$ per sentence. This means that we will need $2^{190}$ bits to code a sentence on average which is almost impossible.

Sometimes we will also normalize the perplexity from sentence to words. We assume the number of words in sentence $x_i$ is $t_i$, the perplexity of the model regarding the word is defined as follows:

\[\begin{aligned} PP'(W) &= b^{- \frac{1}{N} \sum_{i=1}^{N} \frac{1}{t_i} \log_b q(s_i)} \\ &= b^{- \frac{1}{N} \sum_{i=1}^{N} \frac{1}{t_i} \log_b \big[ \prod_{j=1}^{t_i} p(w_j|w_1,w_2,\cdots,w_{j-1})\big]}\\ \end{aligned}\]When $t_i = T$ for all $1 \leq i \leq N$, which usually happens during training, we could further simplify the expression of perplexity.

\[\begin{aligned} PP'(W) &= b^{- \frac{1}{NT} \sum_{i=1}^{N} \sum_{j=1}^{T} \big[ \log_b p(w_j|w_1,w_2,\cdots,w_{j-1})\big]}\\ \end{aligned}\]Usually, a model perplexity of $2^{7.95} = 247$ per word is not bad. This means that we will need 247 bits to code a word on average.

### Final Remarks

Perplexity, or equivalently cross entropy, could be used directly as the optimization goal in training for language modeling.