Introduction to Dirichlet Distribution
Introduction
Dirichlet distribution, also called multivariate beta distribution, is widely used in text mining techniques, such as Dirichlet process and latent Dirichlet allocation. To have a better understanding of these text mining techniques, we have to first understand Dirichlet distribution thoroughly. To understand the Dirichlet distribution from scratch, we would also need to understand binomial distribution, multinomial distribution, gamma function, beta distribution, and their relationships.
In this tutorial, we are going through the fundamentals of binomial distribution, multinomial distribution, gamma function, beta distribution, and Dirichlet distribution, laying the foundations to Dirichlet process and latent Dirichlet allocation.
Binomial Distribution
Binomial distribution, parameterized by $n$ and $p$, is the discrete probability distribution of the number of successes $x$ in a sequence of $n$ Bernoulli trials with success probability of $p$. Formally, we denote $P(x;n,p) \sim B(n,p)$.
$$
\begin{align}
P(x;n,p) &= \binom{n}{x} p^{x} (1-p)^{n-x} \\
&= \frac{n!}{x!(n-x)!} p^{x} (1-p)^{n-x}
\end{align}
$$
where $x \in \mathbb{Z}$ and $0 \leq x \leq n$.
It is very easy to understand the formula. We select $x$ balls from $n$ balls for success, the rest $n-x$ balls are considered as failures. There are $\binom{n}{x}$ combinations to select the successful balls. The probability for each combination is $p^{x} (1-p)^{n-x}$.
Multinomial Distribution
Multinomial distribution is simply a generalized high dimensional version of binomial distribution. The variable, instead of being a single scalar value in binomial distribution, is a multivariable vector in multinomial distribution.
In multinomial distribution, we are not doing Bernoulli trials any more. Instead, each trial has $k$ possible consequences, with success probabilities of $\boldsymbol{p} = \{p_1, p_2, \cdots, p_k\}$ for each possible consequence. Multinomial distribution is the the discrete probability distribution of the number of successes $\boldsymbol{x} = \{x_1, x_2, \cdots, x_k\}$ for each of the possible consequence in a sequence of $n$ such trials. Formally, we denote $P(\boldsymbol{x};n,\boldsymbol{p}) \sim \text{Mult}(n,\boldsymbol{p})$.
$$
\begin{align}
P(\boldsymbol{x};n, \boldsymbol{p}) &= \binom{n}{x_1} \binom{n - x_1}{x_2} \cdots \binom{n - \sum_{j=1}^{i-1}x_j}{x_i} \cdots \binom{x_k}{x_k} {p_1}^{x_1} {p_2}^{x_2} \cdots {p_k}^{x_k}\\
&= \prod_{i=1}^{k} \binom{n - \sum_{j=1}^{i-1}x_j}{x_i} \prod_{i=1}^{k} {p_i}^{x_i}\\
&= \prod_{i=1}^{k} \frac{(n - \sum_{j=1}^{i-1}x_j)!}{x_i!(n - \sum_{j=1}^{i}x_j)!} \prod_{i=1}^{k} {p_i}^{x_i}\\
&= \frac{n!}{x_1!(n-{x_1})!} \frac{n-x_1!}{x_2!(n-x_1-x_2)!}\cdots\frac{x_k!}{x_k!0!} \prod_{i=1}^{k} {p_i}^{x_i}\\
&= \frac{n!}{\prod_{i=1}^{k} {x_i}!} \prod_{i=1}^{k} {p_i}^{x_i}
\end{align}
$$
where for $1 \leq i \leq k$, $x_i \in \mathbb{Z}$, $0 \leq x_i \leq n$, $\sum_{i=1}^{k}x_i = n$, and $\sum_{i=1}^{k}p_i = 1$.
It is also not hard to understand the formula. We select $x_1$ balls from $n$ balls for trials with consequence $1$, $x_2$ balls from the remaining $n-x_1$ balls for trials with consequence $2$, etc., and $x_k$ balls from the remaining $x_k$ balls for trials with consequence $k$. There are $\prod_{i=1}^{k} \binom{n - \sum_{j=1}^{i-1}x_j}{x_i}$ ways to select balls. The probability for each combination is $\prod_{i=1}^{k} {p_i}^{x_i}$.
Gamma Function
We will talk about gamma function, instead of gamma distribution, because gamma distribution does not need to be directly related to Dirichlet distribution. Gamma function is well defined for any complex number $x$ whose real component $\Re(x)$ is greater than 0. The definition of gamma function is
$$
\begin{align}
\Gamma(x) = \int_{0}^{\infty} {s^{x-1} e^{-s} ds}
\end{align}
$$
where $\Re(x) > 0$. In this tutorial, all the numbers we are using are non-complex.
Gamma function has a special property, which will be used for deriving the properties of beta distribution and Dirichlet distribution.
$$
\begin{align}
\Gamma(x+1) = x\Gamma(x)
\end{align}
$$
The proof is presented as follows using the definition of gamma function and integral by parts.
$$
\begin{align*}
\Gamma(x+1) &= \int_{0}^{\infty} {s^{x} e^{-s} ds} \\
&= \big[s^{x} (-e^{-s})\big] \big|_{0}^{\infty} - \int_{0}^{\infty} {(x s^{x-1}) (-e^{-s}) ds} \\
&= (0 - 0) + x \int_{0}^{\infty} {s^{x-1} e^{-s} ds} \\
&= x \Gamma(x)
\end{align*}
$$
This concludes the proof.
There are some special values for gamma function.
$$
\begin{align}
\Gamma(1) & =1\\
\Gamma(\frac{1}{2}) &= \sqrt{\pi}
\end{align}
$$
It might not be trivial to find $\Gamma(\frac{1}{2}) = \sqrt{\pi}$. To show this, we have to use the properties of Gaussian distribution.
The probability density of a Gaussian distribution is well defined from $-\infty$ to $\infty$.
$$
\begin{align}
\varphi(x;\mu,\sigma^2) = \frac{1}{\sqrt{2\pi\sigma^2}} e^{-\frac{(x-\mu)^2}{2\sigma^2}}
\end{align}
$$
When $\mu = 0$ and $\sigma^2 = \frac{1}{2}$,
$$
\begin{align}
\varphi(x) = \frac{1}{\sqrt{\pi}} e^{-{x^2}}
\end{align}
$$
This Gaussian distribution is symmetric about $x=0$. So we have
$$
\begin{align*}
\int_{0}^{\infty} \varphi(x) dx &= \int_{0}^{\infty} \frac{1}{\sqrt{\pi}} e^{-{x^2}} dx \\
&= \frac{1}{\sqrt{\pi}} \int_{0}^{\infty} e^{-{x^2}} dx \\
&= \frac{1}{2}
\end{align*}
$$
Therefor,
$$
\begin{align}
\int_{0}^{\infty} e^{-{x^2}} dx = \frac{\sqrt{\pi}}{2}
\end{align}
$$
We use this integral for $\Gamma(\frac{1}{2})$.
$$
\begin{align*}
\Gamma(\frac{1}{2}) &= \int_{0}^{\infty} {s^{-\frac{1}{2}} e^{-s} ds} \\
&= 2 \int_{0}^{\infty} { e^{-s} d{s^{\frac{1}{2}}}} \\
&= 2 \int_{0}^{\infty} e^{-x^2} d{x} \\
&= 2 \frac{\sqrt{\pi}}{2} \\
&= \sqrt{\pi}
\end{align*}
$$
Because $\Gamma(1) =1$ and $\Gamma(x+1) = x\Gamma(x)$, when $x$ is positive integer, gamma function is exactly the factorial function. In general, gamma function could be considered as the continuous interpolated factorial function.
Jacobian
The Jacobian, denoted by $J$, is the determinant of Jacobian matrix, which appears when changing the variables in multiple integrals.
For a continuous 1-to-1 transformation $\Phi$ mapping for a region $D$ from space $(x_1, x_2, \cdots, x_k)$ to a region $D^{\ast}$ from space $(y_1, y_2, \cdots, y_k)$,
$$
\begin{gather*}
x_1 = \Phi_1(y_1, y_2, \cdots, y_k) \\
x_2 = \Phi_2(y_1, y_2, \cdots, y_k) \\
\cdots \\
x_k = \Phi_k(y_1, y_2, \cdots, y_k) \\
\end{gather*}
$$
The Jacobian matrix, denoted by $\frac{\partial(x_1, x_2, \cdots, x_k)}{\partial(y_1, y_2, \cdots, y_k)}$, is defined as
$$
\begin{align}
\frac{\partial(x_1, x_2, \cdots, x_k)}{\partial(y_1, y_2, \cdots, y_k)} = \begin{bmatrix}
\frac{\partial x_1}{\partial y_1} & \frac{\partial x_1}{\partial y_2} & \dots & \frac{\partial x_1}{\partial y_k} \\
\frac{\partial x_2}{\partial y_1} & \frac{\partial x_2}{\partial y_2} & \dots & \frac{\partial x_2}{\partial y_k} \\
\vdots & \vdots & \ddots & \vdots \\
\frac{\partial x_k}{\partial y_1} & \frac{\partial x_k}{\partial y_2} & \dots & \frac{\partial x_k}{\partial y_k} \\
\end{bmatrix}
\end{align}
$$
The Jacobian of the Jacobian matrix is defined as the determinant of the Jacobian matrix.
$$
\begin{align}
J = \left| \frac{\partial(x_1, x_2, \cdots, x_k)}{\partial(y_1, y_2, \cdots, y_k)} \right|
\end{align}
$$
We then have the following transformation for the multiple integrals.
$$
\begin{align}
\idotsint\limits_{D} f(x_1, x_2, \cdots, x_k) d{x_1}d{x_2}{\cdots}d{x_k}= \idotsint\limits_{D^{\ast}} f\left(\Phi\left(y_1, y_2, \cdots, y_k\right)\right) J d{y_1}d{y_2}{\cdots}d{y_k}
\end{align}
$$
Beta Distribution
Beta distribution is a family of continuous probability distributions well defined on the interval $[0, 1]$ parametrized by two positive shape parameters, denoted by $\alpha$ and $\beta$, that controls the shape of the distribution. Formally, we denote $P(p;\alpha,\beta) \sim \text{Beta}(\alpha,\beta)$.
$$
\begin{align}
P(p;\alpha,\beta) = \frac{1}{B(\alpha,\beta)}p^{\alpha-1}(1-p)^{\beta-1}
\end{align}
$$
where $B(\alpha,\beta)$ is some constant normalizer, and
$$
\begin{align}
B(\alpha,\beta) = \frac{\Gamma(\alpha)\Gamma(\beta)}{\Gamma(\alpha+\beta)}
\end{align}
$$
It is less well known how $B(\alpha,\beta)$ could be expressed in this way. So we would derive this in this tutorial.
Because
$$
\begin{align*}
\int_{0}^{1} P(p;\alpha,\beta) dp &= \int_{0}^{1} \frac{1}{B(\alpha,\beta)}p^{\alpha-1}(1-p)^{\beta-1} dp \\
&= \frac{1}{B(\alpha,\beta)} \int_{0}^{1} p^{\alpha-1}(1-p)^{\beta-1} dp \\
&= 1
\end{align*}
$$
We have
$$
\begin{align}
B(\alpha,\beta) = \int_{0}^{1} p^{\alpha-1}(1-p)^{\beta-1} dp
\end{align}
$$
We then check what $\Gamma(\alpha)\Gamma(\beta)$ is.
$$
\begin{align}
\Gamma(\alpha)\Gamma(\beta) &= \int_{0}^{\infty}u^{\alpha-1}e^{-u}du \int_{0}^{\infty}v^{\beta-1}e^{-v}dv \nonumber\\
&= \int_{0}^{\infty} \int_{0}^{\infty} u^{\alpha-1}e^{-u} v^{\beta-1}e^{-v} du dv \nonumber\\
&= \int_{0}^{\infty} \int_{0}^{\infty} u^{\alpha-1} v^{\beta-1}e^{-(u+v)} du dv
\end{align}
$$
We set $x=\frac{u}{u+v}$ and $y=u+v$ where $x \in [0,1]$ and $y \in [0,\infty]$, so the mapping from $uv$ space to $xy$ space is
$$
\begin{align}
u &= xy \\
v &= (1-x)y
\end{align}
$$
The Jacobian matrix is
$$
\begin{align}
\frac{\partial(u,v)}{\partial(x,y)} &=
\begin{bmatrix}
\frac{\partial u}{\partial x} & \frac{\partial u}{\partial y} \\
\frac{\partial v}{\partial x} & \frac{\partial v}{\partial y} \\
\end{bmatrix} \nonumber\\
&=
\begin{bmatrix}
y & x \\
-y & 1-x \\
\end{bmatrix}
\end{align}
$$
The Jacobian is
$$
\begin{align}
J &= \left| \frac{\partial(u,v)}{\partial(x,y)} \right| \nonumber\\
&= y(1-x) - x(-y) \nonumber\\
&= y
\end{align}
$$
By applying the transoformation for multiple integrals,
$$
\begin{align}
\Gamma(\alpha)\Gamma(\beta) &= \int_{0}^{\infty} \int_{0}^{\infty} u^{\alpha-1} v^{\beta-1}e^{-(u+v)} du dv \nonumber\\
&= \int_{y=0}^{\infty} \int_{x=0}^{1} (xy)^{\alpha-1} \left[\left(1-x\right)y\right]^{\beta-1}e^{-y} y dx dy \nonumber\\
&= \int_{y=0}^{\infty} \int_{x=0}^{1} x^{\alpha-1} (1-x)^{\beta-1} y^{\alpha + \beta -1} e^{-y} dx dy \nonumber\\
&= \int_{0}^{\infty} y^{\alpha + \beta -1} e^{-y} dy\int_{0}^{1} x^{\alpha-1} (1-x)^{\beta-1} dx \nonumber\\
&= \Gamma(\alpha+\beta) \int_{0}^{1} x^{\alpha-1} (1-x)^{\beta-1} dx \nonumber\\
&= \Gamma(\alpha+\beta) B(\alpha,\beta)
\end{align}
$$
Therefore, this concludes the proof for
$$
\begin{align*}
B(\alpha,\beta) = \frac{\Gamma(\alpha)\Gamma(\beta)}{\Gamma(\alpha+\beta)}
\end{align*}
$$
In addition, beta distribution is the conjugate prior for binomial distribution. We have prior $P(p;\alpha,\beta) \sim \text{Beta}(\alpha, \beta)$, and likelihood $P(x|p;n) \sim B(n,p)$.
$$
\begin{align}
P(p|x;n,\alpha,\beta) &\propto P(x|p;n) P(p;\alpha,\beta) \nonumber\\
&= \binom{n}{x} p^{x} (1-p)^{n-x} \frac{1}{B(\alpha,\beta)}p^{\alpha-1}(1-p)^{\beta-1} \nonumber\\
&= \frac{\binom{n}{x}}{B(\alpha,\beta)}p^{x + \alpha-1}(1-p)^{n - x + \beta-1} \nonumber\\
&\propto p^{x + \alpha-1}(1-p)^{n - x + \beta-1}
\end{align}
$$
Therefore, $P(p|x;n,\alpha,\beta) \sim \text{Beta}(x + \alpha, n - x + \beta)$.
Dirichlet Distribution
Analogous to multinomial distribution to binomial distribution, Dirichlet is the multinomial version for the beta distribution. Dirichlet distribution is a family of continuous probability distribution for a discrete probability distribution for $k$ categories $\boldsymbol{p} = \{p_1, p_2, \cdots, p_k\}$, where $0 \leq p_i \leq 1$ for $i \in [1,k]$ and $\sum_{i=1}^{k} p_i = 1$, denoted by $k$ parameters $\boldsymbol{\alpha} = \{\alpha_1, \alpha_2, \cdots, \alpha_k\}$. Formally, we denote $P(\boldsymbol{p};\boldsymbol{\alpha}) \sim \text{Dir}(\boldsymbol{\alpha})$.
$$
\begin{align}
P(\boldsymbol{p};\boldsymbol{\alpha}) = \frac{1}{B(\boldsymbol{\alpha})}\prod_{i=1}^{k} {p_i}^{\alpha_i-1}
\end{align}
$$
where $B(\boldsymbol{\alpha})$ is some constant normalizer, and
$$
\begin{align}
B(\boldsymbol{\alpha}) = \frac{\prod_{i=1}^{k}\Gamma(\alpha_i)}{\Gamma \left(\sum_{i=1}^{k}\alpha_i \right)}
\end{align}
$$
Not surprisingly, when $k=2$, Dirichlet distribution becomes beta distribution.
Similar to the normalizer in beta distribution, we would show $B(\boldsymbol{\alpha})$ could be expressed in this way.
Because
$$
\begin{align*}
\int P(\boldsymbol{p};\boldsymbol{\alpha}) d\boldsymbol{p} &= \int \frac{1}{B(\boldsymbol{\alpha})}\prod_{i=1}^{k} {p_i}^{\alpha_i-1} d\boldsymbol{p} \\
&= \int_{p_{k}=0}^{1} \int_{p_{k-1}=0}^{1} \cdots \int_{p_1=0}^{1} \frac{1}{B(\boldsymbol{\alpha})}\prod_{i=1}^{k} {p_i}^{\alpha_i-1} d{p_1} d{p_2} \cdots d{p_{k-1}} d{p_{k}}\\
&= \int_{p_{k}=0}^{1} \int_{p_{k-1}=0}^{1} \cdots \int_{p_1=0}^{1} \frac{1}{B(\boldsymbol{\alpha})}\left(\prod_{i=1}^{k-1} {p_i}^{\alpha_i-1}\right)\left(1-\sum_{i=1}^{k-1} p_i\right)^{\alpha_k-1} d{p_1} d{p_2} \cdots d{p_{k-1}} d{p_{k}}\\
&= \int_{p_{k}=0}^{1} d{p_{k}} \int_{p_{k-1}=0}^{1} \cdots \int_{p_1=0}^{1} \frac{1}{B(\boldsymbol{\alpha})}\left(\prod_{i=1}^{k-1} {p_i}^{\alpha_i-1}\right)\left(1-\sum_{i=1}^{k-1} p_i\right)^{\alpha_k-1} d{p_1} d{p_2} \cdots d{p_{k-1}} \\
&= \int_{p_{k-1}=0}^{1} \cdots \int_{p_1=0}^{1} \frac{1}{B(\boldsymbol{\alpha})}\left(\prod_{i=1}^{k-1} {p_i}^{\alpha_i-1}\right)\left(1-\sum_{i=1}^{k-1} p_i\right)^{\alpha_k-1} d{p_1} d{p_2} \cdots d{p_{k-1}} \\
&= \frac{1}{B(\boldsymbol{\alpha})} \int_{p_{k-1}=0}^{1} \cdots \int_{p_1=0}^{1} \left(\prod_{i=1}^{k-1} {p_i}^{\alpha_i-1}\right)\left(1-\sum_{i=1}^{k-1} p_i\right)^{\alpha_k-1} d{p_1} d{p_2} \cdots d{p_{k-1}} \\
&= 1
\end{align*}
$$
We have
$$
\begin{align}
B(\boldsymbol{\alpha}) = \int_{p_{k-1}=0}^{1} \cdots \int_{p_1=0}^{1} \left(\prod_{i=1}^{k-1} {p_i}^{\alpha_i-1}\right)\left(1-\sum_{i=1}^{k-1} p_i\right)^{\alpha_k-1} d{p_1} d{p_2} \cdots d{p_{k-1}}
\end{align}
$$
We then check what $\prod_{i=1}^{k}\Gamma(\alpha_i)$ is.
$$
\begin{align}
\prod_{i=1}^{k}\Gamma(\alpha_i) &= \prod_{i=1}^{k} \int_{0}^{\infty}{x_i}^{\alpha_i-1}e^{-x_i}d{x_i} \nonumber\\
&= \int_{x_{k}=0}^{\infty}\int_{x_{k-1}=0}^{\infty} \cdots \int_{x_1=0}^{\infty} \left( \prod_{i=1}^{k} \left({x_i}^{\alpha_i-1}e^{-x_i}\right)\right) d{x_1} d{x_2} \cdots d{x_k} \nonumber\\
&= \int_{x_{k}=0}^{\infty}\int_{x_{k-1}=0}^{\infty} \cdots \int_{x_1=0}^{\infty} e^{-\sum_{i=1}^{k}x_i} \prod_{i=1}^{k} {x_i}^{\alpha_i-1} d{x_1} d{x_2} \cdots d{x_k}
\end{align}
$$
We set
$$
\begin{gather*}
z_k = \sum_{i=1}^{k} x_i \\
y_1 = \frac{x_1}{z_k} \\
\vdots \\
y_{k-1} = \frac{x_{k-1}}{z_k}
\end{gather*}
$$
where $y_i \in [0,1]$ for $0 \leq i \leq k-1$ and $z_k \in [0,\infty]$, so the mapping from $(x_1, x_2, \cdots, x_k)$ space to $(y_1, y_2, \cdots, y_{k-1}, z_k)$ space is
$$
\begin{gather*}
x_1 = y_1 z_k \\
x_2 = y_2 z_k \\
\vdots \\
x_{k-1} = y_{k-1} z_k \\
x_{k} = \left(1- \sum_{i=1}^{k-1} y_{i}\right) z_k \\
\end{gather*}
$$
The Jacobian matrix is
$$
\begin{align}
\frac{\partial(x_1,x_2,\cdots,x_{k-1},x_k)}{\partial(y_1,y_2,\cdots,y_{k-1},z_k)} &=
\begin{bmatrix}
\frac{\partial x_1}{\partial y_1} & \frac{\partial x_1}{\partial y_2} & \dots & \frac{\partial x_1}{\partial y_{k-1}} & \frac{\partial x_1}{\partial z_k} \\
\frac{\partial x_2}{\partial y_1} & \frac{\partial x_2}{\partial y_2} & \dots & \frac{\partial x_2}{\partial y_{k-1}} & \frac{\partial x_2}{\partial z_k} \\
\vdots & \vdots & \ddots & \vdots & \vdots \\
\frac{\partial x_{k-1}}{\partial y_1} & \frac{\partial x_{k-1}}{\partial y_2} & \dots & \frac{\partial x_{k-1}}{\partial y_{k-1}} & \frac{\partial x_{k-1}}{\partial z_k} \\
\frac{\partial x_k}{\partial y_1} & \frac{\partial x_k}{\partial y_2} & \dots & \frac{\partial x_{k}}{\partial y_{k-1}} & \frac{\partial x_k}{\partial z_k} \\
\end{bmatrix} \nonumber\\
&=
\begin{bmatrix}
z_k & 0 & \dots & 0 & y_1 \\
0 & z_k & \dots & 0 & y_2 \\
\vdots & \vdots & \ddots & \vdots & \vdots \\
0 & 0 & \dots & z_k & y_{k-1} \\
-z_k & -z_k & \dots & -z_k & (1- \sum_{i=1}^{k-1} y_{i}) \\
\end{bmatrix}
\end{align}
$$
The Jacobian is computed via Gaussian elimination by adding each of the row from row $1$ to $k-1$ to row $k$.
$$
\begin{align}
J &= \left| \frac{\partial(x_1,x_2,\cdots,x_{k-1},x_k)}{\partial(y_1,y_2,\cdots,y_{k-1},z_k)} \right| \nonumber\\
&=
\begin{vmatrix}
z_k & 0 & \dots & 0 & y_1 \\
0 & z_k & \dots & 0 & y_2 \\
\vdots & \vdots & \ddots & \vdots & \vdots \\
0 & 0 & \dots & z_k & y_{k-1} \\
-z_k & -z_k & \dots & -z_k & (1- \sum_{i=1}^{k-1} y_{i}) \\
\end{vmatrix} \nonumber\\
&=
\begin{vmatrix}
z_k & 0 & \dots & 0 & y_1 \\
0 & z_k & \dots & 0 & y_2 \\
\vdots & \vdots & \ddots & \vdots & \vdots \\
0 & 0 & \dots & z_k & y_{k-1} \\
0 & 0 & \dots & 0 & 1 \\
\end{vmatrix} \\
&= {z_k}^{k-1}
\end{align}
$$
By applying the transformation for multiple integrals,
$$
\begin{align}
& \prod_{i=1}^{k}\Gamma(\alpha_i) \\
&= \int_{x_{k}=0}^{\infty}\int_{x_{k-1}=0}^{\infty} \cdots \int_{x_1=0}^{\infty} e^{-\sum_{i=1}^{k}x_i} \prod_{i=1}^{k} {x_i}^{\alpha_i-1} d{x_1} d{x_2} \cdots d{x_{k-1}}d{x_k} \nonumber\\
&= \int_{z_{k}=0}^{\infty}\int_{y_{k-1}=0}^{1} \cdots \int_{y_1=0}^{1} e^{-z_k} \left(\prod_{i=1}^{k-1} {\left(y_i z_k\right)}^{\alpha_i-1}\right)\left(z_k\left(1- \sum_{i=1}^{k-1} y_{i}\right)\right)^{\alpha_k - 1} {z_k}^{k-1} d{y_1} d{y_2} \cdots d{y_{k-1}}d{z_k} \nonumber \\
&= \int_{z_{k}=0}^{\infty}\int_{y_{k-1}=0}^{1} \cdots \int_{y_1=0}^{1} e^{-z_k} {z_k}^{\left(\sum_{i=1}^{k} \alpha_i \right) - k} \left(\prod_{i=1}^{k-1} {y_i}^{\alpha_i-1}\right)\left(1- \sum_{i=1}^{k-1} y_{i}\right)^{\alpha_k - 1} {z_k}^{k-1} d{y_1} d{y_2} \cdots d{y_{k-1}}d{z_k} \nonumber \\
&= \int_{z_{k}=0}^{\infty}\int_{y_{k-1}=0}^{1} \cdots \int_{y_1=0}^{1} e^{-z_k} {z_k}^{\left(\sum_{i=1}^{k} \alpha_i \right) - 1} \left(\prod_{i=1}^{k-1} {y_i}^{\alpha_i-1}\right)\left(1- \sum_{i=1}^{k-1} y_{i}\right)^{\alpha_k - 1} d{y_1} d{y_2} \cdots d{y_{k-1}}d{z_k} \nonumber \\
&= \int_{z_{k}=0}^{\infty} e^{-z_k} {z_k}^{\left(\sum_{i=1}^{k} \alpha_i \right) - 1} d{z_k} \int_{y_{k-1}=0}^{1} \cdots \int_{y_1=0}^{1} \left(\prod_{i=1}^{k-1} {y_i}^{\alpha_i-1}\right)\left(1- \sum_{i=1}^{k-1} y_{i}\right)^{\alpha_k - 1} d{y_1} d{y_2} \cdots d{y_{k-1}} \nonumber \\
&= \Gamma \left(\sum_{i=i}^{k} \alpha_i \right) \int_{y_{k-1}=0}^{1} \cdots \int_{y_1=0}^{1} \left(\prod_{i=1}^{k-1} {y_i}^{\alpha_i-1}\right)\left(1- \sum_{i=1}^{k-1} y_{i}\right)^{\alpha_k - 1} d{y_1} d{y_2} \cdots d{y_{k-1}} \\
&= \Gamma \left(\sum_{i=i}^{k} \alpha_i \right) B(\boldsymbol{\alpha})
\end{align}
$$
Therefore, this concludes the proof for
$$
\begin{align*}
B(\boldsymbol{\alpha}) = \frac{\prod_{i=1}^{k}\Gamma(\alpha_i)}{\Gamma \left(\sum_{i=1}^{k}\alpha_i \right)}
\end{align*}
$$
In addition, Dirichlet distribution is the conjugate prior for multinomial distribution. We have prior $P(\boldsymbol{p};\boldsymbol{\alpha}) \sim \text{Dir}(\boldsymbol{\alpha})$, and likelihood $P(\boldsymbol{x}|\boldsymbol{p};n) \sim \text{Mult}(n,\boldsymbol{p})$.
$$
\begin{align}
P(\boldsymbol{p}|\boldsymbol{x};n,\boldsymbol{\alpha}) &\propto P(\boldsymbol{x}|\boldsymbol{p};n) P(\boldsymbol{p};\boldsymbol{\alpha}) \nonumber\\
&= \prod_{i=1}^{k} \binom{n - \sum_{j=1}^{i-1}x_j}{x_i} \prod_{i=1}^{k} {p_i}^{x_i} \frac{1}{B(\boldsymbol{\alpha})}\prod_{i=1}^{k} {p_i}^{\alpha_i-1} \nonumber\\
&= \frac{\prod_{i=1}^{k} \binom{n - \sum_{j=1}^{i-1}x_j}{x_i}}{B(\boldsymbol{\alpha})}\prod_{i=1}^{k} {p_i}^{x_i + \alpha_i-1} \nonumber\\
&\propto \prod_{i=1}^{k} {p_i}^{x_i + \alpha_i-1}
\end{align}
$$
Therefore, $P(\boldsymbol{p}|\boldsymbol{x};n,\boldsymbol{\alpha}) \sim \text{Dir}(\boldsymbol{x} + \boldsymbol{\alpha})$.
References
Introduction to Dirichlet Distribution
https://leimao.github.io/blog/Introduction-to-Dirichlet-Distribution/