 ### Lei Mao

Machine Learning, Artificial Intelligence. On the Move.

# Introduction to Dirichlet Distribution

### Introduction

Dirichlet distribution, also called multivariate beta distribution, is widely used in text mining techniques, such as Dirichlet process and latent Dirichlet allocation. To have a better understanding of these text mining techniques, we have to first understand the Dirichlet distribution throughly. To understand the Dirichlet distribution from scratch, we would also need to understand binomial distribution, multinomial distribution, gamma function, beta distribution, and their relationships.

In this tutorial, we are going through the fundamentals of binomial distribution, multinomial distribution, gamma function, beta distribution, and Dirichlet distribution, laying the foundations to Dirichlet process and latent Dirichlet allocation.

### Binomial Distribution

Binomial distribution, parameterized by $n$ and $p$, is the discrete probability distribution of the number of successes $x$ in a sequence of $n$ Bernoulli trials with success probability of $p$. Formally, we denote $P(x;n,p) \sim B(n,p)$.

where $x \in \mathbb{Z}$ and $0 \leq x \leq n$.

It is very easy to understand the formula. We select $x$ balls from $n$ balls for success, the rest $n-x$ balls are considered as failures. There are $\binom{n}{x}$ combinations to select the successful balls. The probability for each combination is $p^{x} (1-p)^{n-x}$.

### Multinomial Distribution

Multinomial distribution is simply a generalized high dimensional version of binomial distribution. The variable, instead of being a single scalar value in binomial distribution, is a multivariable vector in multinomial distribution.

In multinomial distribution, we are not doing Bernoulli trials any more. Instead, each trial has $k$ possible consequences, with success probabilities of $\boldsymbol{p} = \{p_1, p_2, \cdots, p_k\}$ for each possible consequence. Multinomial distribution is the the discrete probability distribution of the number of successes $\boldsymbol{x} = \{x_1, x_2, \cdots, x_k\}$ for each of the possible consequence in a sequence of $n$ such trials. Formally, we denote $P(\boldsymbol{x};n,\boldsymbol{p}) \sim \text{Mult}(n,\boldsymbol{p})$.

where for $1 \leq i \leq k$, $x_i \in \mathbb{Z}$, $0 \leq x_i \leq n$, $\sum_{i=1}^{k}x_i = n$, and $\sum_{i=1}^{k}p_i = 1$.

It is also not hard to understand the formula. We select $x_1$ balls from $n$ balls for trials with consequence $1$, $x_2$ balls from the remaining $n-x_1$ balls for trials with consequence $2$, etc., and $x_k$ balls from the remaining $x_k$ balls for trials with consequence $k$. There $\prod_{i=1}^{k} \binom{n - \sum_{j=1}^{i-1}x_j}{x_i}$ ways to select balls. The probability for each combination is $\prod_{i=1}^{k} {p_i}^{x_i}$.

### Gamma Function

We will talk about gamma function, instead of gamma distribution, because gamma distribution does not need to be directly related to Dirichlet distribution. Gamma function is well defined for any complex number $x$ whose real component $\Re(x)$ is greater than 0. The definition of gamma function is

where $\Re(x) > 0$. In this tutorial, all the numbers we are using are non-complex.

Gamma function has a special property, which will be used for deriving the properties of beta distribution and Dirichlet distribution.

The proof is presented as follows using the definition of gamma function and integral by parts.

This concludes the proof.

There are some special values for gamma function.

It might not be trivial to find $\Gamma(\frac{1}{2}) = \sqrt{\pi}$. To show this, we have to use the properties of Gaussian distribution.

The probability density of a Gaussian distribution is well defined from $-\infty$ to $\infty$.

When $\mu = 0$ and $\sigma^2 = \frac{1}{2}$,

This Gaussian distribution is symmetric about $x=0$. So we have

Therefor,

We use this integral for $\Gamma(\frac{1}{2})$.

Because $\Gamma(1) =1$ and $\Gamma(x+1) = x\Gamma(x)$, when $x$ is positive integer, gamma function is exactly the factorial function. In general, gamma function could be considered as the continuous interpolated factorial function.

### Jacobian

The Jacobian, denoted by $J$, is the determinant of Jacobian matrix, which appears when changing the variables in multiple integrals.

For a continuous 1-to-1 transformation $\Phi$ mapping for a region $D$ from space $(x_1, x_2, \cdots, x_k)$ to a region $D^{\ast}$ from space $(y_1, y_2, \cdots, y_k)$,

The Jacobian matrix, denoted by $\frac{\partial(x_1, x_2, \cdots, x_k)}{\partial(y_1, y_2, \cdots, y_k)}$, is defined as

The Jacobian of the Jacobian matrix is defined as the determinant of the Jacobian matrix.

We then have the following transformation for the multiple integrals.

### Beta Distribution

Beta distribution is a family of continuous probability distributions well defined on the interval $[0, 1]$ parametrized by two positive shape parameters, denoted by $\alpha$ and $\beta$, that controls the shape of the distribution. Formally, we denote $P(p;\alpha,\beta) \sim \text{Beta}(\alpha,\beta)$.

where $B(\alpha,\beta)$ is some constant normalizer, and

It is less well known how $B(\alpha,\beta)$ could be expressed in this way. So we would derive this in this tutorial.

Because

We have

We then check what $\Gamma(\alpha)\Gamma(\beta)$ is.

We set $x=\frac{u}{u+v}$ and $y=u+v$ where $x \in [0,1]$ and $y \in [0,\infty]$, so the mapping from $uv$ space to $xy$ space is

The Jacobian matrix is

The Jocobian is

By applying the transoformation for multiple integrals,

Therefore, this concludes the proof for

In addition, beta distribution is the conjugate prior for binomial distribution. We have prior $P(p;\alpha,\beta) \sim \text{Beta}(\alpha, \beta)$, and likelihood $P(x|p;n) \sim B(n,p)$.

Therefore, $P(p|x;n,\alpha,\beta) \sim \text{Beta}(x + \alpha, n - x + \beta)$.

### Dirichlet Distribution

Analogous to multinomial distribution to binomial distribution, Dirichlet is the multinomial version for the beta distribution. Dirichlet distribution is a family of continuous probability distribution for a discrete probability distribution for $k$ categories $\boldsymbol{p} = \{p_1, p_2, \cdots, p_k\}$, where $0 \leq p_i \leq 1$ for $i \in [1,k]$ and $\sum_{i=1}^{k} p_i = 1$, denoted by $k$ parameters $\boldsymbol{\alpha} = \{\alpha_1, \alpha_2, \cdots, \alpha_k\}$. Formally, we denote $P(\boldsymbol{p};\boldsymbol{\alpha}) \sim \text{Dir}(\boldsymbol{\alpha})$.

where $B(\boldsymbol{\alpha})$ is some constant normalizer, and

Not surprisingly, when $k=2$, Dirichlet distribution becomes beta distribution.

Similar to the normalizer in beta distribution, we would show $B(\boldsymbol{\alpha})$ could be expressed in this way.

Because

We have

We then check what $\prod_{i=1}^{k}\Gamma(\alpha_i)$ is.

We set

where $y_i \in [0,1]$ for $0 \leq i \leq k-1$ and $z_k \in [0,\infty]$, so the mapping from $(x_1, x_2, \cdots, x_k)$ space to $(y_1, y_2, \cdots, y_{k-1}, z_k)$ space is

The Jacobian matrix is

The Jocobian is computed via Gaussian elimination by adding each of the row from row $1$ to $k-1$ to row $k$.

By applying the transformation for multiple integrals,

Therefore, this concludes the proof for

In addition, Dirichlet distribution is the conjugate prior for multinomial distribution. We have prior $P(\boldsymbol{p};\boldsymbol{\alpha}) \sim \text{Dir}(\boldsymbol{\alpha})$, and likelihood $P(\boldsymbol{x}|\boldsymbol{p};n) \sim \text{Mult}(n,\boldsymbol{p})$.

Therefore, $P(\boldsymbol{p}|\boldsymbol{x};n,\boldsymbol{\alpha}) \sim \text{Dir}(\boldsymbol{x} + \boldsymbol{\alpha})$.