Lei Mao bio photo

Lei Mao

Machine Learning, Artificial Intelligence, Computer Science.

Twitter Facebook LinkedIn GitHub   G. Scholar E-Mail RSS

Introduction

The principle of maximum entropy states that the probability distribution which best represents the current state of knowledge is the one with largest entropy, in the context of precisely stated prior data (such as a proposition that expresses testable information). These prior data serves as the constrains to the probability distribution.


Given the second law of thermodynamics (principle of increase of entropy), isolated systems spontaneously evolve towards thermodynamic equilibrium, the state with maximum entropy, maximum entropy distributions become the most natural distributions under certain constrains. In this blog post, I would like to discuss entropy maximization and a couple of maximum entropy distributions.

Prerequisites

Gaussian Integral

\[\begin{align} \int_{-\infty}^{\infty} e^{-x^2} dx = \sqrt{\pi} \\ \end{align}\]

I will skip the proof here, since the proof from Wikipedia is not that difficult to understand.

Useful Integrals

\[\begin{align} \int_{-\infty}^{\infty} x e^{-x^2} dx &= -\frac{1}{2} \int_{-\infty}^{\infty} e^{-x^2} d(-x^2) \\ &= -\frac{1}{2} e^{-x^2} \big\rvert_{-\infty}^{\infty}\\ &= 0 \\ \end{align}\] \[\begin{align} \int_{-\infty}^{\infty} x^2 e^{-x^2} dx &= -\frac{1}{2} \int_{-\infty}^{\infty} x d (e^{-x^2}) \\ &= -\frac{1}{2} \Big( x e^{-x^2} \big\rvert_{-\infty}^{\infty} - \int_{-\infty}^{\infty} e^{-x^2} dx \Big) \\ &= -\frac{1}{2} \Big( 0 - \sqrt{\pi} \Big) \\ &= \frac{\sqrt{\pi}}{2} \\ \end{align}\]

Notice that here we used integral by parts.

Entropy Maximization

Discrete Probability Distribution

Suppose $P$ is a discrete probability distribution. The entropy is defined as

\[\begin{align} H(P) &= - \sum_{x \in X}^{} P(x) \log P(x) \\ \end{align}\]

We further have some constrains on $P$:

  • $P(x) \geq 0$
  • $\sum_{x \in X}^{} P(x) = 1$
  • $\sum_{x \in X}^{} P(x) r_i(x) = \alpha_i$ for $1 \leq i \leq m$

The first two constrains are trivial given $P$ is a probability distribution. The third constrain is optional and it indicates a constrain on the entire system. Notice that there could be more than one constrain if $m > 1$.


We would like to maximize the entropy.

\[\max_{P} H(P) = \max_{P} \Big( - \sum_{x \in X}^{} P(x) \log P(x) \Big)\]

Let’s try to solve this optimization problem. We would use Lagrange multiplier for the constrains.

\[\begin{align} L(P, \lambda_0, \lambda_1, \cdots, \lambda_m) &= - \sum_{x \in X}^{} P(x) \log P(x) + \lambda_0 \Big(\sum_{x \in X}^{} P(x) - 1 \Big) + \sum_{i=1}^{m} \lambda_i \sum_{x \in X}^{} \Big(P(x) r_i(x) - \alpha_i \Big) \\ \end{align}\]

We take the derivative of $L(P, \lambda_0, \lambda_1, \cdots, \lambda_m)$ with respect to $P(x)$ and the derivative should be $0$.

\[\begin{align} \frac{\partial}{\partial P(x)} L(P, \lambda_0, \lambda_1, \cdots, \lambda_m) &= - \log P(x) - 1 + \lambda_0 + \sum_{i=1}^{m} \lambda_i r_i(x) \\ &= 0 \\ \end{align}\]

Therefore,

\[\begin{align} P(x) &= e^{\big(\sum_{i=1}^{m} \lambda_i r_i(x)\big) + \lambda_0 - 1 } \\ &= \frac{ e^{\sum_{i=1}^{m} \lambda_i r_i(x)} }{e^{1 - \lambda_0}} \\ \end{align}\]

Because $\sum_{x \in X}^{} P(x) = 1$,

\[\begin{align} \sum_{x \in X}^{} P(x) &= \sum_{x \in X}^{} e^{\big(\sum_{i=1}^{m} \lambda_i r_i(x)\big) + \lambda_0 - 1 } \\ &= e^{\lambda_0 - 1} \sum_{x \in X}^{} e^{\sum_{i=1}^{m} \lambda_i r_i(x)} \\ &= 1 \\ \end{align}\]

Therefore,

\[e^{1 - \lambda_0} = \sum_{x \in X}^{} e^{\sum_{i=1}^{m} \lambda_i r_i(x)}\]

With this, we could rewrite $P(x)$ as

\[\begin{align} P(x) &= \frac{ e^{\sum_{i=1}^{m} \lambda_i r_i(x)} }{ \sum_{x \in X}^{} e^{\sum_{i=1}^{m} \lambda_i r_i(x)} } \\ \end{align}\]

Continuous Probability Distribution

Similarly, suppose $P$ is a continuous probability distribution. The entropy is defined as

\[\begin{align} H(P) &= - \int_{X}^{} P(x) \log P(x) dx \\ \end{align}\]

With the following constrains

  • $P(x) \geq 0$
  • $\int_{X}^{} P(x) dx = 1$
  • $\int_{X}^{} P(x) r_i(x) dx = \alpha_i$ for $1 \leq i \leq m$

Similarly, to maximize the entropy, we maximize the Lagrangian for the continuous case.

\[\begin{align} L(P, \lambda_0, \lambda_1, \cdots, \lambda_m) &= - \int_{X}^{} P(x) \log P(x) dx + \lambda_0 \Big(\int_{X}^{} P(x) dx - 1 \Big) + \sum_{i=1}^{m} \lambda_i \Big( \int_{X}^{} P(x) r_i(x) dx - \alpha_i \Big) \\ \end{align}\]

We take the derivative of $L(P, \lambda_0, \lambda_1, \cdots, \lambda_m)$ with respect to $P(x)$ and the derivative should be $0$. We will also use the calculus of variations to compute the derivative, which is slightly more complicated. Without going into all the details, we have the following derivatives.

\[\begin{align} \frac{\partial}{\partial P(x)} L(P, \lambda_0, \lambda_1, \cdots, \lambda_m) &= - \frac{\partial}{\partial P(x)} \int_{X}^{} P(x) \log P(x) dx + \lambda_0 \frac{\partial}{\partial P(x)} \Big(\int_{X}^{} P(x) dx - 1 \Big) + \sum_{i=1}^{m} \lambda_i \frac{\partial}{\partial P(x)} \Big( \int_{X}^{} P(x) r_i(x) dx - \alpha_i \Big) \\ &= - \int_{X}^{} \frac{\partial}{\partial P(x)} \big( P(x) \log P(x) \big) dx + \lambda_0 \frac{\partial}{\partial P(x)} \Big(\int_{X}^{} P(x) dx - 1 \Big) + \sum_{i=1}^{m} \lambda_i \frac{\partial}{\partial P(x)} \Big( \int_{X}^{} P(x) r_i(x) dx - \alpha_i \Big) \\ &= - \log P(x) - 1 + \lambda_0 + \sum_{i=1}^{m} \lambda_i r_i(x) \\ &= 0 \\ \end{align}\]

Therefore,

\[\begin{align} P(x) &= e^{\big(\sum_{i=1}^{m} \lambda_i r_i(x)\big) + \lambda_0 - 1 } \\ &= \frac{ e^{\sum_{i=1}^{m} \lambda_i r_i(x)} }{e^{1 - \lambda_0}} \\ \end{align}\]

Because $\int_{X}^{} P(x) dx = 1$,

\[\begin{align} \int_{X}^{} P(x) dx &= \int_{X}^{} e^{\big(\sum_{i=1}^{m} \lambda_i r_i(x)\big) + \lambda_0 - 1 } dx \\ &= e^{\lambda_0 - 1} \int_{X}^{} e^{\sum_{i=1}^{m} \lambda_i r_i(x)} dx \\ &= 1 \\ \end{align}\]

Therefore,

\[e^{1 - \lambda_0} = \int_{X}^{} e^{\sum_{i=1}^{m} \lambda_i r_i(x)} dx\]

With this, we could rewrite $P(x)$ as

\[\begin{align} P(x) &= \frac{ e^{\sum_{i=1}^{m} \lambda_i r_i(x)} }{ \int_{X}^{} e^{\sum_{i=1}^{m} \lambda_i r_i(x)} dx } \\ \end{align}\]

Maximum Entropy Distribution Examples

Roll Dice

A conventional dice has 6 faces. $X = \{ 1, 2, 3, 4, 5, 6 \}$. Because we don’t have additional constrains, therefore

\[\lambda_1 = \lambda_2 = \cdots = \lambda_m = 0\]

So, the maximum entropy probability distribution of getting each face of the dice is

\[\begin{align} P(x) &= \frac{ e^{\sum_{i=1}^{m} \lambda_i r_i(x)} }{ \sum_{x \in X}^{} e^{\sum_{i=1}^{m} \lambda_i r_i(x)} } \\ &= \frac{ e^{0} }{ \sum_{x \in X}^{} e^{0} } \\ &= \frac{ 1 }{ 6 } \\ \end{align}\]

Uniform Distribution

The only constrain we put on a distribution is $X = [a, b]$. Because we don’t have additional constrains, therefore

\[\lambda_1 = \lambda_2 = \cdots = \lambda_m = 0\]

So the maximum entropy probability distribution is actually uniform distribution.

\[\begin{align} P(x) &= \frac{ e^{\sum_{i=1}^{m} \lambda_i r_i(x)} }{ \int_{X}^{} e^{\sum_{i=1}^{m} \lambda_i r_i(x)} dx } \\ &= \frac{ e^{0} }{ \int_{a}^{b} e^{0} dx } \\ &= \frac{ 1 }{ b - a } \\ \end{align}\]

Gaussian Distribution

We could also derive Gaussian Distribution using entropy maximization. The constrains for the maximum entropy distribution are

  • $X = (-\infty, \infty)$
  • $\mathbb{E}[X] = \int_{-\infty}^{\infty} x P(x) dx = \mu$
  • $\mathbb{V}[X] = \mathbb{E}[X^2] - \mathbb{E}[X]^2 = \mathbb{E}[X^2] - \mu^2 = \int_{-\infty}^{\infty} x^2 P(x) dx - \mu^2 = \sigma^2$

which translates to

  • $m = 2$
  • $r_1(x) = x$, $\alpha_1 = \mu$
  • $r_2(x) = x^2$, $\alpha_2 = \sigma^2$
\[\begin{align} P(x) &= e^{\lambda_0 - 1 + \lambda_1 x + \lambda_2 x^2} \\ \end{align}\]

Because

\[\begin{align} \int_{-\infty}^{\infty} P(x) dx &= 1 \\ \end{align}\]

We have

\[\begin{align} \int_{-\infty}^{\infty} P(x) dx &= \int_{-\infty}^{\infty} e^{\lambda_0 - 1 + \lambda_1 x + \lambda_2 x^2} dx \\ &= \int_{-\infty}^{\infty} e^{\lambda_0 - 1 + \lambda_1 x + \lambda_2 x^2} dx \\ &= \int_{-\infty}^{\infty} \exp \big( \lambda_0 - 1 + \lambda_1 x + \lambda_2 x^2 \big) dx \\ &= \int_{-\infty}^{\infty} \exp \bigg( \lambda_2 \Big[ \big(x + \frac{\lambda_1}{2\lambda_2}\big)^2 - \frac{\lambda_1^2 - 4 \lambda_0 \lambda_2 + 4 \lambda_2}{4\lambda_2^2} \Big] \bigg) dx \\ &= \int_{-\infty}^{\infty} \exp \bigg( \lambda_2 \big(x + \frac{\lambda_1}{2\lambda_2}\big)^2 - \frac{\lambda_1^2 - 4 \lambda_0 \lambda_2 + 4 \lambda_2}{4\lambda_2} \bigg) dx \\ &= \exp \bigg( - \frac{\lambda_1^2 - 4 \lambda_0 \lambda_2 + 4 \lambda_2}{4\lambda_2} \bigg) \int_{-\infty}^{\infty} \exp \bigg( \lambda_2 \big(x + \frac{\lambda_1}{2\lambda_2}\big)^2 \bigg) dx \\ &= \exp \bigg( - \frac{\lambda_1^2 - 4 \lambda_0 \lambda_2 + 4 \lambda_2}{4\lambda_2} \bigg) \int_{-\infty}^{\infty} \exp \bigg( -(-\lambda_2) \big(x + \frac{\lambda_1}{2\lambda_2}\big)^2 \bigg) dx \\ \end{align}\]

Here we assume $\lambda_2 < 0$, we further have

\[\begin{align} \int_{-\infty}^{\infty} P(x) dx &= \exp \bigg( - \frac{\lambda_1^2 - 4 \lambda_0 \lambda_2 + 4 \lambda_2}{4\lambda_2} \bigg) \int_{-\infty}^{\infty} \exp \bigg( -(-\lambda_2) \big(x + \frac{\lambda_1}{2\lambda_2}\big)^2 \bigg) dx \\ &= \exp \bigg( - \frac{\lambda_1^2 - 4 \lambda_0 \lambda_2 + 4 \lambda_2}{4\lambda_2} \bigg) \int_{-\infty}^{\infty} \exp \bigg( - \Big( \sqrt{ -\lambda_2 }\big(x + \frac{\lambda_1}{2\lambda_2}\big) \Big)^2 \bigg) dx \\ &= \frac{1}{\sqrt{ -\lambda_2 }} \exp \bigg( - \frac{\lambda_1^2 - 4 \lambda_0 \lambda_2 + 4 \lambda_2}{4\lambda_2} \bigg) \int_{-\infty}^{\infty} \exp \bigg( - \Big( \sqrt{ -\lambda_2 }\big(x + \frac{\lambda_1}{2\lambda_2}\big) \Big)^2 \bigg) d \Big( \sqrt{ -\lambda_2 }\big(x + \frac{\lambda_1}{2\lambda_2}\big) \Big) \\ \end{align}\]

To make it more clear, we set

\[y = \sqrt{ -\lambda_2 }\big(x + \frac{\lambda_1}{2\lambda_2}\big)\]

So using Gaussian integral, we further have

\[\begin{align} \int_{-\infty}^{\infty} P(x) dx &= \frac{1}{\sqrt{ -\lambda_2 }} \exp \bigg( - \frac{\lambda_1^2 - 4 \lambda_0 \lambda_2 + 4 \lambda_2}{4\lambda_2} \bigg) \int_{-\infty}^{\infty} \exp \bigg( - \Big( \sqrt{ -\lambda_2 }\big(x + \frac{\lambda_1}{2\lambda_2}\big) \Big)^2 \bigg) d \Big( \sqrt{ -\lambda_2 }\big(x + \frac{\lambda_1}{2\lambda_2}\big) \Big) \\ &= \frac{1}{\sqrt{ -\lambda_2 }} \exp \bigg( - \frac{\lambda_1^2 - 4 \lambda_0 \lambda_2 + 4 \lambda_2}{4\lambda_2} \bigg) \int_{-\infty}^{\infty} e^{-y^2} d y \\ &= \frac{1}{\sqrt{ -\lambda_2 }} \exp \bigg( - \frac{\lambda_1^2 - 4 \lambda_0 \lambda_2 + 4 \lambda_2}{4\lambda_2} \bigg) \sqrt{\pi} \\ &= 1 \\ \end{align}\]

Therefore, we have our first equation from the constrains.

\[\frac{1}{\sqrt{ -\lambda_2 }} \exp \bigg( - \frac{\lambda_1^2 - 4 \lambda_0 \lambda_2 + 4 \lambda_2}{4\lambda_2} \bigg) = \frac{1}{\sqrt{\pi}}\]

Because

\[\mathbb{E}[X] = \int_{-\infty}^{\infty} x P(x) dx = \mu\]

Similarly, We have

\[\begin{align} \int_{-\infty}^{\infty} x P(x) dx &= \int_{-\infty}^{\infty} x e^{\lambda_0 - 1 + \lambda_1 x + \lambda_2 x^2} dx \\ &= \frac{1}{\sqrt{ -\lambda_2 }} \exp \bigg( - \frac{\lambda_1^2 - 4 \lambda_0 \lambda_2 + 4 \lambda_2}{4\lambda_2} \bigg) \int_{-\infty}^{\infty} x \exp \bigg( - \Big( \sqrt{ -\lambda_2 }\big(x + \frac{\lambda_1}{2\lambda_2}\big) \Big)^2 \bigg) d \Big( \sqrt{ -\lambda_2 }\big(x + \frac{\lambda_1}{2\lambda_2}\big) \Big) \\ &= \frac{1}{\sqrt{\pi}} \int_{-\infty}^{\infty} \frac{1}{\sqrt{ -\lambda_2 }} \Big( \sqrt{ -\lambda_2 }\big(x + \frac{\lambda_1}{2\lambda_2}\big) - \frac{\lambda_1\sqrt{ -\lambda_2 }}{2\lambda_2} \Big) \exp \bigg( - \Big( \sqrt{ -\lambda_2 }\big(x + \frac{\lambda_1}{2\lambda_2}\big) \Big)^2 \bigg) d \Big( \sqrt{ -\lambda_2 }\big(x + \frac{\lambda_1}{2\lambda_2}\big) \Big) \\ &= \frac{1}{\sqrt{\pi}} \frac{1}{\sqrt{ -\lambda_2 }} \int_{-\infty}^{\infty} \Big( \sqrt{ -\lambda_2 }\big(x + \frac{\lambda_1}{2\lambda_2}\big) - \frac{\lambda_1\sqrt{ -\lambda_2 }}{2\lambda_2} \Big) \exp \bigg( - \Big( \sqrt{ -\lambda_2 }\big(x + \frac{\lambda_1}{2\lambda_2}\big) \Big)^2 \bigg) d \Big( \sqrt{ -\lambda_2 }\big(x + \frac{\lambda_1}{2\lambda_2}\big) \Big) \\ &= \frac{1}{\sqrt{\pi}} \frac{1}{\sqrt{ -\lambda_2 }} \int_{-\infty}^{\infty} \Big( y - \frac{\lambda_1\sqrt{ -\lambda_2 }}{2\lambda_2} \Big) e^{-y^2} d y \\ &= \frac{1}{\sqrt{\pi}} \frac{1}{\sqrt{ -\lambda_2 }} \Bigg( \bigg( \int_{-\infty}^{\infty} y e^{-y^2} d y \bigg) - \bigg( \frac{\lambda_1\sqrt{ -\lambda_2 }}{2\lambda_2} \int_{-\infty}^{\infty} e^{-y^2} d y \bigg) \Bigg) \\ &= \frac{1}{\sqrt{\pi}} \frac{1}{\sqrt{ -\lambda_2 }} \Bigg( 0 - \frac{\lambda_1\sqrt{ -\lambda_2 }}{2\lambda_2} \sqrt{\pi}\Bigg) \\ &= -\frac{\lambda_1}{2\lambda_2} \\ &= \mu \\ \end{align}\]

Therefore, we have our second equation from the constrains.

\[-\frac{\lambda_1}{2\lambda_2} = \mu\]

Because

\[\mathbb{V}[X] = \mathbb{E}[X^2] - \mathbb{E}[X]^2 = \mathbb{E}[X^2] - \mu^2 = \int_{-\infty}^{\infty} x^2 P(x) dx - \mu^2 = \sigma^2\]

Similarly, We have

\[\begin{align} \int_{-\infty}^{\infty} x^2 P(x) dx &= \int_{-\infty}^{\infty} x^2 e^{\lambda_0 - 1 + \lambda_1 x + \lambda_2 x^2} dx \\ &= \frac{1}{\sqrt{ -\lambda_2 }} \exp \bigg( - \frac{\lambda_1^2 - 4 \lambda_0 \lambda_2 + 4 \lambda_2}{4\lambda_2} \bigg) \int_{-\infty}^{\infty} x^2 \exp \bigg( - \Big( \sqrt{ -\lambda_2 }\big(x + \frac{\lambda_1}{2\lambda_2}\big) \Big)^2 \bigg) d \Big( \sqrt{ -\lambda_2 }\big(x + \frac{\lambda_1}{2\lambda_2}\big) \Big) \\ &= \frac{1}{\sqrt{\pi}} \int_{-\infty}^{\infty} \bigg( \frac{1}{-\lambda_2 } \Big( \sqrt{ -\lambda_2 }\big(x + \frac{\lambda_1}{2\lambda_2}\big) \Big)^2 - \frac{\lambda_1}{\lambda_2}x - \frac{\lambda_1^2}{4\lambda_2^2} \bigg) \exp \bigg( - \Big( \sqrt{ -\lambda_2 }\big(x + \frac{\lambda_1}{2\lambda_2}\big) \Big)^2 \bigg) d \Big( \sqrt{ -\lambda_2 }\big(x + \frac{\lambda_1}{2\lambda_2}\big) \Big) \\ &= \frac{1}{\sqrt{\pi}} \Bigg( \frac{1}{-\lambda_2 } \int_{-\infty}^{\infty} \Big( \sqrt{ -\lambda_2 }\big(x + \frac{\lambda_1}{2\lambda_2}\big) \Big)^2 \exp \bigg( - \Big( \sqrt{ -\lambda_2 }\big(x + \frac{\lambda_1}{2\lambda_2}\big) \Big)^2 \bigg) d \Big( \sqrt{ -\lambda_2 }\big(x + \frac{\lambda_1}{2\lambda_2}\big) \Big) \\ &- \frac{\lambda_1}{\lambda_2} \int_{-\infty}^{\infty} x \exp \bigg( - \Big( \sqrt{ -\lambda_2 }\big(x + \frac{\lambda_1}{2\lambda_2}\big) \Big)^2 \bigg) d \Big( \sqrt{ -\lambda_2 }\big(x + \frac{\lambda_1}{2\lambda_2}\big) \Big) \\ &- \frac{\lambda_1^2}{4\lambda_2^2} \int_{-\infty}^{\infty} \exp \bigg( - \Big( \sqrt{ -\lambda_2 }\big(x + \frac{\lambda_1}{2\lambda_2}\big) \Big)^2 \bigg) d \Big( \sqrt{ -\lambda_2 }\big(x + \frac{\lambda_1}{2\lambda_2}\big) \Big) \Bigg)\\ &= \frac{1}{\sqrt{\pi}} \Bigg( \frac{1}{-\lambda_2 } \int_{-\infty}^{\infty} y^2 e^{-y^2} dy - \frac{\lambda_1}{\lambda_2} \int_{-\infty}^{\infty} x e^{-y^2} dy - \frac{\lambda_1^2}{4\lambda_2^2} \int_{-\infty}^{\infty} e^{-y^2} dy \Bigg) \\ &= \frac{1}{\sqrt{\pi}} \Bigg( \frac{1}{-\lambda_2 } \frac{\sqrt{\pi}}{2} - \frac{\lambda_1}{\lambda_2} \Big( -\frac{\lambda_1}{2\lambda_2} \sqrt{\pi} \Big) - \frac{\lambda_1^2}{4\lambda_2^2} \sqrt{\pi} \Bigg) \\ &= -\frac{1}{2 \lambda_2^2} + \frac{\lambda_1^2}{4\lambda_2^2}\\ &= \mu^2 + \sigma^2 \\ \end{align}\]

Therefore, we have our third equation from the constrains.

\[-\frac{1}{2 \lambda_2^2} + \frac{\lambda_1^2}{4\lambda_2^2} = \mu^2 + \sigma^2\]

Taken together, to derive the values for $\lambda_0$, $\lambda_1$, and $\lambda_2$, we have to solve

\[\begin{gather} \frac{1}{\sqrt{ -\lambda_2 }} \exp \bigg( - \frac{\lambda_1^2 - 4 \lambda_0 \lambda_2 + 4 \lambda_2}{4\lambda_2} \bigg) = \frac{1}{\sqrt{\pi}} \\ -\frac{\lambda_1}{2\lambda_2} = \mu \\ -\frac{1}{2 \lambda_2^2} + \frac{\lambda_1^2}{4\lambda_2^2} = \mu^2 + \sigma^2 \\ \end{gather}\]

This gets us

\[\begin{gather} \lambda_0 = \log \Big( \frac{1}{\sqrt{2\pi} \sigma} \Big) - \frac{\mu^2}{2 \sigma^2} + 1 \\ \lambda_1 = \frac{\mu}{\sigma^2} \\ \lambda_2 = -\frac{1}{2\sigma^2} \\ \end{gather}\]

Do not forgot check if $\lambda_2 < 0$ since we have made this assumption at the beginning of our derivation.


We plugin these values back to $P(x)$.

\[\begin{align} P(x) &= e^{\lambda_0 - 1 + \lambda_1 x + \lambda_2 x^2} \\ &= \exp \bigg( \log \Big( \frac{1}{\sqrt{2\pi} \sigma} \Big) - \frac{\mu^2}{2 \sigma^2} + \frac{\mu}{\sigma^2} x - -\frac{1}{2\sigma^2} x^2 \bigg) \\ &= \frac{1}{\sqrt{2\pi} \sigma} \exp \bigg( - \frac{1}{2 \sigma^2} \Big( \mu^2 - 2\mu x + x^2 \Big) \bigg) \\ &= \frac{1}{\sqrt{2\pi} \sigma} e^{ - \frac{1}{2} \big( \frac{x - \mu}{\sigma} \big)^2 } \\ \end{align}\]

Strikingly, $P(x)$ is Gaussian distribution!

Boltzmann Distribution

The entropy we have talked about so far is Shannon entropy. In thermodynamics, the entropy is defined as

\[\begin{align} S(P) &= - k_B \sum_{x \in X}^{} P(x) \log P(x) \\ \end{align}\]

where $k_B$ is a physical constant known as Boltzmann’s constant.


This entropy is called Gibbs entropy which is different from Shannon entropy by the Boltzmann’s constant $k_B$. We could still derive Boltzmann Distribution using entropy maximization. The constrains for the maximum entropy distribution are

  • $X = \{1, 2, \cdots, n\}$
  • $m = 1$
  • $\sum_{x \in X}^{} P(x) \varepsilon(x) = U$

The $U$ in constrains is actually the internal energy of the system, $\varepsilon(x)$ is the energy state of the system and it is quantized in thermodynamics.


Given all the derivations in this article, it is not hard to find that for Gibbs entropy, the maximum entropy probability distribution is

\[\begin{align} P(x) &= e^{\big(\frac{1}{k_B} \sum_{i=1}^{m} \lambda_i r_i(x)\big) + \frac{\lambda_0}{k_B} - 1 } \\ &= \frac{ e^{ \frac{1}{k_B} \sum_{i=1}^{m} \lambda_i r_i(x)} }{e^{1 - \frac{\lambda_0}{k_B} }} \\ \end{align}\]

where

\[e^{1 - \frac{\lambda_0}{k_B} } = \sum_{x \in X}^{} e^{ \frac{1}{k_B} \sum_{i=1}^{m} \lambda_i r_i(x)}\]

Given the constrains,

\[\begin{align} P(x) &= e^{\frac{1}{k_B} \lambda_1 \varepsilon(x) + \frac{\lambda_0}{k_B} - 1 } \\ &= \frac{ e^{ \frac{1}{k_B} \lambda_1 \varepsilon(x)} }{e^{1 - \frac{\lambda_0}{k_B} }} \\ \end{align}\]

where

\[e^{1 - \frac{\lambda_0}{k_B} } = \sum_{x \in X}^{} e^{ \frac{1}{k_B} \lambda_1 \varepsilon(x)}\]

Let’s find out what the values of $\lambda_0$ and $\lambda_1$ are.

\[\begin{align} S(P) &= - k_B \sum_{x \in X}^{} P(x) \log P(x) \\ &= - k_B \sum_{x \in X}^{} P(x) \log \big( e^{\frac{1}{k_B} \lambda_1 \varepsilon(x) + \frac{\lambda_0}{k_B} - 1 } \big) \\ &= - k_B \sum_{x \in X}^{} P(x) \big( \frac{1}{k_B} \lambda_1 \varepsilon(x) + \frac{\lambda_0}{k_B} - 1 \big) \\ &= - k_B \sum_{x \in X}^{} P(x) \Big( \big( \frac{1}{k_B} \lambda_1 \varepsilon(x) \big) + \big( \frac{\lambda_0}{k_B} - 1 \big) \Big) \\ &= - k_B \bigg( \frac{1}{k_B} \lambda_1 \sum_{x \in X}^{} P(x) \varepsilon(x) + \big( \frac{\lambda_0}{k_B} - 1 \big) \sum_{x \in X}^{} P(x) \bigg) \\ &= - k_B \bigg( \frac{1}{k_B} \lambda_1 U + \big( \frac{\lambda_0}{k_B} - 1 \big) \bigg) \\ &= - \lambda_1 U + \lambda_0 - k_B \\ \end{align}\]

According to the definition of internal energy and the first law of thermodynamics, we have the following thermodynamic identity.

\[dU = T dS - p dV\]

where $U$ is the internal energy, $T$ is the temperature, $S$ is the entropy, $p$ is the pressure, and $V$ is the volume of the system.


In our case, we assume there will be no volume change, therefore

\[\frac{\partial S}{\partial U} = \frac{1}{T}\]

We immediately find in our case,

\[\frac{\partial S(P)}{\partial U} = - \lambda_1 = \frac{1}{T}\]

Therefore, $\lambda_1 = -\frac{1}{T}$. We have the maximum entropy distribution.

\[\begin{align} P(x) &= \frac{ e^{ -\frac{ \varepsilon(x) }{k_B T} } }{ \sum_{x \in X} e^{ -\frac{ \varepsilon(x) }{k_B T} } } \\ \end{align}\]

This is exactly the Boltzmann distribution.

Conclusions

Maximum entropy distribution is everywhere. It reflects the nature of a system under certain constrains. The collection of maximum entropy distributions and their constrains could be found on Wikipedia.