Optimal Codeword Length
Introduction
In lossless data compression, we want to encode the data using as few bits as possible and decode the data unambiguously. The number of bits used to encode the data is called the codeword length. The code is a sequence of codewords. The expected codeword length is the average number of bits used to encode the data. The optimal codeword length is the codeword length that minimizes the expected codeword length.
Shannon entropy has a very strong implication in the context of data compression. In this blog post, I would like to discuss how to understand the implication of Shannon entropy for optimal codeword length in the context of information theory, without showing rigorous mathematical derivations and proofs.
Information Content
Given a random variable $X$ with a probability mass function $P(X)$, the information content of $X$ is defined as
$$
\begin{align}
I(X)
&= \log_{b} \frac{1}{P(X)} \\
&= - \log_{b} P(X) \\
\end{align}
$$
where $b$ is the base of the logarithm. If $b = 2$, the unit of information content is bit.
Shannon Entropy
Given a random variable $X$ with a probability mass function $P(X)$, the Shannon entropy of $X$ is defined as
$$
\begin{align}
H(X)
&= \mathbb{E}[I(X)] \\
&= \mathbb{E}[- \log_b{P(X)}] \\
&= - \sum_{x \in \mathcal{X}} P(x) \log_b{P(x)} \\
\end{align}
$$
where $b$ is the base of the logarithm. If $b = 2$, the unit of Shannon entropy is bit.
Optimal Codeword Length
The information content can be used for describing the optimal codeword length of the events of a random variable that minimizes the expected codeword length. Because the code is a sequence of codewords, information content is useful for data compression.
Suppose the random variable $X$ can only be one of the four codewords $A$, $B$, $C$, and $D$ with non-zero probabilities. Using binary encoding, we know there are infinite number of different ways of encoding the four codewords and decode unambiguously. However, we could not use number of bits less than 2 to encode the four codewords. Because if we use the number of bits less than 2 to encode the four codewords, we could not distinguish the four codewords. Therefore, the optimal codeword length of each of the four codewords is 2 bits. In fact, if the number of codewords with non-zero probabilities is $2^n$, the optimal codeword length of each of the codewords is $n$ bits.
But what if the number of codewords with non-zero probabilities is not a power of 2? Suppose the number of codewords with non-zero probabilities is between $2^{n-1}$ and $2^n$. Even though we could still use $n$ bits to encode each of the codewords, our intuition tells us that we could use less than $n$ bits to encode some of the codewords, in a way such that the expected codeword length is minimized.
Suppose the random variable $X$ can only be one of the three codewords $A$, $B$, $C$. It is easy to see that we can use one bit to encode one of the three codewords, use two bits to encode the other two codewords, and decode unambiguously. For example, the codewords $A$, $B$, $C$ can be encoded as $0$, $10$, and $11$ and the code consisting of these codewords can still be decoded unambiguously. The question is, how should the assignment be made to minimize the expected codeword length? It depends on the probabilities of the codewords in the code.
If the probabilities of the three codewords in the code are exactly the same, i.e.,
$$
\begin{align}
P(X)
&=
\begin{cases}
\frac{1}{3} & \text{$X = A$} \\
\frac{1}{3} & \text{$X = B$} \\
\frac{1}{3} & \text{$X = C$} \\
\end{cases} \\
\end{align}
$$
It does not matter which codeword is encoded by one bit and which codeword is encoded by two bits. The optimal expected codeword length can be computed as
$$
\begin{align}
\mathbb{E}[L(C(X))]
&= \sum_{x \in {A, B, C}} P(C(x)) L(C(x)) \\
&= \sum_{x \in {A, B, C}} P(x) L(C(x)) \\
&= \sum_{x \in {A, B, C}} \frac{1}{3} L(C(x)) \\
&= \frac{1}{3} \sum_{x \in {A, B, C}} L(C(x)) \\
&= \frac{1}{3} (1 + 2 + 2) \\
&= \frac{5}{3} \\
\end{align}
$$
If the probabilities of the three codewords in the code are not exactly the same, for instance,
$$
\begin{align}
P(X)
&=
\begin{cases}
\frac{1}{2} & \text{$X = A$} \\
\frac{1}{4} & \text{$X = B$} \\
\frac{1}{4} & \text{$X = C$} \\
\end{cases} \\
\end{align}
$$
The expected codeword length can be computed as
$$
\begin{align}
\mathbb{E}[L(C(X))]
&= \sum_{x \in {A, B, C}} P(C(x)) L(C(x)) \\
&= \sum_{x \in {A, B, C}} P(x) L(C(x)) \\
&= P(A) L(C(A)) + P(B) L(C(B)) + P(C) L(C(C)) \\
&= \frac{1}{2} L(C(A)) + \frac{1}{4} L(C(B)) + \frac{1}{4} L(C(C)) \\
\end{align}
$$
To minimize the expected codeword length in this case, it is obvious that the codeword with the high probability should be assigned with fewer bits and the codeword with the low probability should be assigned with more bits. In this case, we can assign the codewords $A$, $B$, $C$ to $0$, $10$, and $11$ respectively. The optimal expected codeword length can be computed as
$$
\begin{align}
\mathbb{E}[L(C(X))]
&= \frac{1}{2} L(C(A)) + \frac{1}{4} L(C(B)) + \frac{1}{4} L(C(C)) \\
&= \frac{1}{2} \times 1 + \frac{1}{4} \times 2 + \frac{1}{4} \times 2 \\
&= \frac{5}{4} \\
\end{align}
$$
In this case, it happens that the optimal codeword lengths for $A$, $B$, $C$ are just the information contents $-\log_2{P(A)} = 1$, $-\log_2{P(B)} = 2$, $-\log_2{P(C)} = 2$, respectively. In this case, the optimal expected codeword length can be computed as
$$
\begin{align}
\mathbb{E}[L(C(X))]
&= - P(A) \log_2{P(A)} - P(B) \log_2{P(B)} - P(C) \log_2{P(C)} \\
&= \sum_{x \in {A, B, C}} - P(x) \log_2{P(x)} \\
&= \mathbb{E}[- \log_2{P(X)}] \\
&= \mathbb{E}[I(X)] \\
&= H(X) \\
\end{align}
$$
This happens to be the expected value of the information content of $X$, $I(X)$, which is just the Shannon entropy of $X$, $H(X)$.
Of course, when the probabilities of the codewords $A$, $B$, $C$ are not the power of 2, the Shannon entropy cannot accurately describe the optimal expected codeword length. However, Shannon’s source coding theorem for symbol codes shows that the Shannon entropy is the lower bound of the optimal expected codeword length, and the difference between the Shannon entropy and the optimal expected codeword length is less than 1 bit.
Formally, Shannon’s source coding theorem for symbol codes states that for any symbol code that encodes a source with entropy $H(X)$, the optimal expected codeword length $\mathbb{E}[L(C(X))]$ satisfies
$$
\begin{align}
\frac{H(X)}{\log_2{a}} \leq \mathbb{E}[L(C(X))] < \frac{H(X)}{\log_2{a}} + 1 \\
\end{align}
$$
where $H(X) = \mathbb{E}[- \log_2{P(X)}]$ is the Shannon entropy of $X$ using logarithmic base of 2 and $a$ is the number of symbols in the alphabet.
For binary encoding, $a = 2$.
$$
\begin{align}
H(X) \leq \mathbb{E}[L(C(X))] < H(X) + 1 \\
\end{align}
$$
The remaining question is, how to construct the code such that the expected codeword length is minimized? The answer is Huffman coding.
References
Optimal Codeword Length