Introduction to Dirichlet Distribution

Introduction

Dirichlet distribution, also called multivariate beta distribution, is widely used in text mining techniques, such as Dirichlet process and latent Dirichlet allocation. To have a better understanding of these text mining techniques, we have to first understand Dirichlet distribution thoroughly. To understand the Dirichlet distribution from scratch, we would also need to understand binomial distribution, multinomial distribution, gamma function, beta distribution, and their relationships.

In this tutorial, we are going through the fundamentals of binomial distribution, multinomial distribution, gamma function, beta distribution, and Dirichlet distribution, laying the foundations to Dirichlet process and latent Dirichlet allocation.

Binomial Distribution

Binomial distribution, parameterized by n and p, is the discrete probability distribution of the number of successes x in a sequence of n Bernoulli trials with success probability of p. Formally, we denote P(x;n,p)B(n,p).

P(x;n,p)=(nx)px(1p)nx=n!x!(nx)!px(1p)nx

where xZ and 0xn.

It is very easy to understand the formula. We select x balls from n balls for success, the rest nx balls are considered as failures. There are (nx) combinations to select the successful balls. The probability for each combination is px(1p)nx.

Multinomial Distribution

Multinomial distribution is simply a generalized high dimensional version of binomial distribution. The variable, instead of being a single scalar value in binomial distribution, is a multivariable vector in multinomial distribution.

In multinomial distribution, we are not doing Bernoulli trials any more. Instead, each trial has k possible consequences, with success probabilities of p={p1,p2,,pk} for each possible consequence. Multinomial distribution is the the discrete probability distribution of the number of successes x={x1,x2,,xk} for each of the possible consequence in a sequence of n such trials. Formally, we denote P(x;n,p)Mult(n,p).

P(x;n,p)=(nx1)(nx1x2)(nj=1i1xjxi)(xkxk)p1x1p2x2pkxk=i=1k(nj=1i1xjxi)i=1kpixi=i=1k(nj=1i1xj)!xi!(nj=1ixj)!i=1kpixi=n!x1!(nx1)!nx1!x2!(nx1x2)!xk!xk!0!i=1kpixi=n!i=1kxi!i=1kpixi

where for 1ik, xiZ, 0xin, i=1kxi=n, and i=1kpi=1.

It is also not hard to understand the formula. We select x1 balls from n balls for trials with consequence 1, x2 balls from the remaining nx1 balls for trials with consequence 2, etc., and xk balls from the remaining xk balls for trials with consequence k. There are i=1k(nj=1i1xjxi) ways to select balls. The probability for each combination is i=1kpixi.

Gamma Function

We will talk about gamma function, instead of gamma distribution, because gamma distribution does not need to be directly related to Dirichlet distribution. Gamma function is well defined for any complex number x whose real component (x) is greater than 0. The definition of gamma function is

Γ(x)=0sx1esds

where (x)>0. In this tutorial, all the numbers we are using are non-complex.

Gamma function has a special property, which will be used for deriving the properties of beta distribution and Dirichlet distribution.

Γ(x+1)=xΓ(x)

The proof is presented as follows using the definition of gamma function and integral by parts.

Γ(x+1)=0sxesds=[sx(es)]|00(xsx1)(es)ds=(00)+x0sx1esds=xΓ(x)

This concludes the proof.

There are some special values for gamma function.

Γ(1)=1Γ(12)=π

It might not be trivial to find Γ(12)=π. To show this, we have to use the properties of Gaussian distribution.

The probability density of a Gaussian distribution is well defined from to .

φ(x;μ,σ2)=12πσ2e(xμ)22σ2

When μ=0 and σ2=12,

φ(x)=1πex2

This Gaussian distribution is symmetric about x=0. So we have

0φ(x)dx=01πex2dx=1π0ex2dx=12

Therefor,

0ex2dx=π2

We use this integral for Γ(12).

Γ(12)=0s12esds=20esds12=20ex2dx=2π2=π

Because Γ(1)=1 and Γ(x+1)=xΓ(x), when x is positive integer, gamma function is exactly the factorial function. In general, gamma function could be considered as the continuous interpolated factorial function.

Jacobian

The Jacobian, denoted by J, is the determinant of Jacobian matrix, which appears when changing the variables in multiple integrals.

For a continuous 1-to-1 transformation Φ mapping for a region D from space (x1,x2,,xk) to a region D from space (y1,y2,,yk),

x1=Φ1(y1,y2,,yk)x2=Φ2(y1,y2,,yk)xk=Φk(y1,y2,,yk)

The Jacobian matrix, denoted by (x1,x2,,xk)(y1,y2,,yk), is defined as

(x1,x2,,xk)(y1,y2,,yk)=[x1y1x1y2x1ykx2y1x2y2x2ykxky1xky2xkyk]

The Jacobian of the Jacobian matrix is defined as the determinant of the Jacobian matrix.

J=|(x1,x2,,xk)(y1,y2,,yk)|

We then have the following transformation for the multiple integrals.

Df(x1,x2,,xk)dx1dx2dxk=Df(Φ(y1,y2,,yk))Jdy1dy2dyk

Beta Distribution

Beta distribution is a family of continuous probability distributions well defined on the interval [0,1] parametrized by two positive shape parameters, denoted by α and β, that controls the shape of the distribution. Formally, we denote P(p;α,β)Beta(α,β).

P(p;α,β)=1B(α,β)pα1(1p)β1

where B(α,β) is some constant normalizer, and

B(α,β)=Γ(α)Γ(β)Γ(α+β)

It is less well known how B(α,β) could be expressed in this way. So we would derive this in this tutorial.

Because

01P(p;α,β)dp=011B(α,β)pα1(1p)β1dp=1B(α,β)01pα1(1p)β1dp=1

We have

B(α,β)=01pα1(1p)β1dp

We then check what Γ(α)Γ(β) is.

Γ(α)Γ(β)=0uα1eudu0vβ1evdv=00uα1euvβ1evdudv=00uα1vβ1e(u+v)dudv

We set x=uu+v and y=u+v where x[0,1] and y[0,], so the mapping from uv space to xy space is

u=xyv=(1x)y

The Jacobian matrix is

(u,v)(x,y)=[uxuyvxvy]=[yxy1x]

The Jacobian is

J=|(u,v)(x,y)|=y(1x)x(y)=y

By applying the transoformation for multiple integrals,

Γ(α)Γ(β)=00uα1vβ1e(u+v)dudv=y=0x=01(xy)α1[(1x)y]β1eyydxdy=y=0x=01xα1(1x)β1yα+β1eydxdy=0yα+β1eydy01xα1(1x)β1dx=Γ(α+β)01xα1(1x)β1dx=Γ(α+β)B(α,β)

Therefore, this concludes the proof for

B(α,β)=Γ(α)Γ(β)Γ(α+β)

In addition, beta distribution is the conjugate prior for binomial distribution. We have prior P(p;α,β)Beta(α,β), and likelihood P(x|p;n)B(n,p).

P(p|x;n,α,β)P(x|p;n)P(p;α,β)=(nx)px(1p)nx1B(α,β)pα1(1p)β1=(nx)B(α,β)px+α1(1p)nx+β1px+α1(1p)nx+β1

Therefore, P(p|x;n,α,β)Beta(x+α,nx+β).

Dirichlet Distribution

Analogous to multinomial distribution to binomial distribution, Dirichlet is the multinomial version for the beta distribution. Dirichlet distribution is a family of continuous probability distribution for a discrete probability distribution for k categories p={p1,p2,,pk}, where 0pi1 for i[1,k] and i=1kpi=1, denoted by k parameters α={α1,α2,,αk}. Formally, we denote P(p;α)Dir(α).

P(p;α)=1B(α)i=1kpiαi1

where B(α) is some constant normalizer, and

B(α)=i=1kΓ(αi)Γ(i=1kαi)

Not surprisingly, when k=2, Dirichlet distribution becomes beta distribution.

Similar to the normalizer in beta distribution, we would show B(α) could be expressed in this way.

Because

P(p;α)dp=1B(α)i=1kpiαi1dp=pk=01pk1=01p1=011B(α)i=1kpiαi1dp1dp2dpk1dpk=pk=01pk1=01p1=011B(α)(i=1k1piαi1)(1i=1k1pi)αk1dp1dp2dpk1dpk=pk=01dpkpk1=01p1=011B(α)(i=1k1piαi1)(1i=1k1pi)αk1dp1dp2dpk1=pk1=01p1=011B(α)(i=1k1piαi1)(1i=1k1pi)αk1dp1dp2dpk1=1B(α)pk1=01p1=01(i=1k1piαi1)(1i=1k1pi)αk1dp1dp2dpk1=1

We have

B(α)=pk1=01p1=01(i=1k1piαi1)(1i=1k1pi)αk1dp1dp2dpk1

We then check what i=1kΓ(αi) is.

i=1kΓ(αi)=i=1k0xiαi1exidxi=xk=0xk1=0x1=0(i=1k(xiαi1exi))dx1dx2dxk=xk=0xk1=0x1=0ei=1kxii=1kxiαi1dx1dx2dxk

We set

zk=i=1kxiy1=x1zkyk1=xk1zk

where yi[0,1] for 0ik1 and zk[0,], so the mapping from (x1,x2,,xk) space to (y1,y2,,yk1,zk) space is

x1=y1zkx2=y2zkxk1=yk1zkxk=(1i=1k1yi)zk

The Jacobian matrix is

(x1,x2,,xk1,xk)(y1,y2,,yk1,zk)=[x1y1x1y2x1yk1x1zkx2y1x2y2x2yk1x2zkxk1y1xk1y2xk1yk1xk1zkxky1xky2xkyk1xkzk]=[zk00y10zk0y200zkyk1zkzkzk(1i=1k1yi)]

The Jacobian is computed via Gaussian elimination by adding each of the row from row 1 to k1 to row k.

J=|(x1,x2,,xk1,xk)(y1,y2,,yk1,zk)|=|zk00y10zk0y200zkyk1zkzkzk(1i=1k1yi)|=|zk00y10zk0y200zkyk10001|=zkk1

By applying the transformation for multiple integrals,

i=1kΓ(αi)=xk=0xk1=0x1=0ei=1kxii=1kxiαi1dx1dx2dxk1dxk=zk=0yk1=01y1=01ezk(i=1k1(yizk)αi1)(zk(1i=1k1yi))αk1zkk1dy1dy2dyk1dzk=zk=0yk1=01y1=01ezkzk(i=1kαi)k(i=1k1yiαi1)(1i=1k1yi)αk1zkk1dy1dy2dyk1dzk=zk=0yk1=01y1=01ezkzk(i=1kαi)1(i=1k1yiαi1)(1i=1k1yi)αk1dy1dy2dyk1dzk=zk=0ezkzk(i=1kαi)1dzkyk1=01y1=01(i=1k1yiαi1)(1i=1k1yi)αk1dy1dy2dyk1=Γ(i=ikαi)yk1=01y1=01(i=1k1yiαi1)(1i=1k1yi)αk1dy1dy2dyk1=Γ(i=ikαi)B(α)

Therefore, this concludes the proof for

B(α)=i=1kΓ(αi)Γ(i=1kαi)

In addition, Dirichlet distribution is the conjugate prior for multinomial distribution. We have prior P(p;α)Dir(α), and likelihood P(x|p;n)Mult(n,p).

P(p|x;n,α)P(x|p;n)P(p;α)=i=1k(nj=1i1xjxi)i=1kpixi1B(α)i=1kpiαi1=i=1k(nj=1i1xjxi)B(α)i=1kpixi+αi1i=1kpixi+αi1

Therefore, P(p|x;n,α)Dir(x+α).

References

Author

Lei Mao

Posted on

09-10-2019

Updated on

09-10-2019

Licensed under


Comments