### Introduction

The statistical inference for non-trivial models, such as multivariate Bayesian models, usually requires doing sampling from joint distributions. The motivations of doing sampling have been discussed in my article “Motivations for Sampling in Statistical Inference”. Such sampling could be done via Markov chain Monte Carlo (MCMC) algorithms. Gibbs sampling is one of the common MCMC algorithms which generates a Markov chain of samples, each of which is correlated with nearby samples.

In this blog post, I would like to discuss the Gibbs sampling algorithm and give an example of applying Gibbs sampling for statistical inference.

### Algorithm

The goal of Gibbs sampling algorithm is to sample from joint distribution $P(X_1, X_2, \cdots, X_D)$.

The algorithm is simple in its form. The algorithm guarantees that the stationary distribution of the samples generated is the joint distribution $P(X_1, X_2, \cdots, X_D)$. Since the Gibbs sampling algorithm is a special case for the Metropolis-Hastings algorithm. The reader might check the proof derived in my article for Metropolis-Hastings algorithm.

\begin{algorithm} \caption{Gibbs Sampling Algorithm} \begin{algorithmic} \Procedure{GibbsSampling}{$n, b$} \State Initialize $x^{(0)} = \{x_1^{0}, x_2^{0}, \cdots, x_D^{0}\} \sim q(x)$ \For {iteration $i = 1, 2, \cdots, n + b$} \State $x_1^{(i)} \sim P(X_1 | X_2 = x_2^{(i-1)}, X_3 = x_3^{(i-1)}, X_4 = x_4^{(i-1)}, \cdots, X_D = x_D^{(i-1)})$ \State $x_2^{(i)} \sim P(X_2 | X_1 = x_1^{(i)}, X_3 = x_3^{(i-1)}, X_4 = x_4^{(i-1)}, \cdots, X_D = x_D^{(i-1)})$ \State $x_3^{(i)} \sim P(X_3 | X_1 = x_1^{(i)}, X_2 = x_2^{(i)}, X_4 = x_4^{(i-1)}, \cdots, X_D = x_D^{(i-1)})$ \State $\vdots$ \State $x_D^{(i)} \sim P(X_D | X_1 = x_1^{(i)}, X_2 = x_2^{(i)}, X_3 = x_3^{(i-1)}, \cdots, X_{D-1} = x_{D-1}^{(i-1)})$ \EndFor \Return $\{x^{(b)}, x^{(b+1)}, x^{(b+2)}, \cdots, x^{(n + b)}\}$ \EndProcedure \end{algorithmic} \end{algorithm}

Note that $q$ can be any distribution, $n$ is the number of samples we want to collect from the Gibbs sampling algorithm, $b$ is the number of burn-in samples generated before we formally collect samples. This is because the samples generated from early iterations are not from the target joint distribution and should be discarded.

### Example

#### Change-Point Model

Here we take the change-point model example from the Computational Cognition Cheat Sheets as an example. Suppose we observe a sequence of counts $\{x_1, x_2, \cdots, x_N\}$ where the average of the counts from step $1$ to $n$ has some value, and the average of the counts from step $n+1$ to $N$ has some different value. We are interested in knowing, given the sequence, statistically, what the most likely $n$ is.

We could model the observation using the following stochastic process.

\[\begin{align} n &\sim \text{Uniform}(1,2,\cdots,N) \\ \lambda_i &\sim \text{Gamma}(\alpha, \beta) \\ x_i &\sim \begin{cases} \text{Poisson}(\lambda_1) \quad 1 \leq i \leq n \\ \text{Poisson}(\lambda_2) \quad n+1 \leq i \leq N \\ \end{cases} \\ \end{align}\]Note that Poisson distribution is used for modeling the generation of counts and $\lambda_1, \lambda_2$ are modeled as the averages for the sequence of counts $x_{1:n}$ and $x_{n+1:N}$ because of the special property of Poisson distribution. The modeling of the prior distribution $P(n), P(\lambda_1), P(\lambda_2)$ are usually not very important. However, we would like to reasonably model it using some friendly distributions so that the derivation of posterior conditional distributions is not too difficult. $N$ is determined by the number of observations, $\alpha$ and $\beta$ are manually chosen constants.

Also keep in mind that that $n, \lambda_1, \lambda_2$ are latent variables that cannot be directly observed, and we want to derive $P(n | x_{1:N})$ and $\mathbb{E}[n]$.

#### Marginal Distribution

According to our model, it is straightforward to derive the posterior joint distribution $P(n, \lambda_1, \lambda_2 | x_{1:N})$. Our first idea is to compute the marginal distribution $P(n | x_{1:N})$ from the the posterior joint distribution $P(n, \lambda_1, \lambda_2 | x_{1:N})$ using the property of total probability. Concretely,

\[\begin{align} P(n | x_{1:N}) = \int_{\lambda_1} \int_{\lambda_2} P(n, \lambda_1, \lambda_2 | x_{1:N}) d \lambda_1 d \lambda_2 \end{align}\]It is not difficult to derive $P(n, \lambda_1, \lambda_2 | x_{1:N})$.

\[\begin{align} P(n, \lambda_1, \lambda_2 | x_{1:N}) &= \frac{P(n, \lambda_1, \lambda_2, x_{1:N})}{P(x_{1:N})} \\ &= \frac{P(n) P(\lambda_1) P(\lambda_2) P(x_{1:n} | \lambda_1) P(x_{n+1:N} | \lambda_2)}{ \int_n \int_{\lambda_1} \int_{\lambda_2} P(n) P(\lambda_1) P(\lambda_2) P(x_{1:n} | \lambda_1) P(x_{n+1:N} | \lambda_2) dn d\lambda_1 d\lambda_2 } \\ \end{align}\] \[\begin{align} \log P(n, \lambda_1, \lambda_2 | x_{1:N}) &= \log P(n) + \log P(\lambda_1) + \log P(\lambda_2) + P(x_{1:n} | \lambda_1) + \log P(x_{n+1:N} | \lambda_2) \\ \end{align}\]We have to compute each term according to the distributions.

\[\begin{align} \text{Gamma}(\lambda; \alpha, \beta) &= \frac{\beta^{\alpha}}{\Gamma(\alpha)} \lambda^{\alpha-1} \exp{(-\beta \lambda)} \\ &= \exp \bigg( \log \Big( \frac{\beta^{\alpha}}{\Gamma(\alpha)} \lambda^{\alpha-1} \Big) -\beta \lambda \bigg) \\ &= \exp \bigg( \alpha \log \beta - \log \Gamma(\alpha) + (\alpha - 1) \log \lambda -\beta \lambda \bigg) \\ &= \exp \bigg( Z(\alpha, \beta) + (\alpha - 1) \log \lambda -\beta \lambda \bigg) \\ &\propto \exp \bigg( (\alpha - 1) \log \lambda -\beta \lambda \bigg) \end{align}\] \[\begin{align} \text{Poisson}(x; \lambda) &= \frac{ \lambda^{x} e^{-\lambda} }{x!} \\ &= \exp \bigg( \log \Big( \frac{ \lambda^{x} }{x!} \Big) -\lambda \bigg) \\ &= \exp \bigg( x \log \lambda - \log \big( x! \big) -\lambda \bigg) \\ &\propto \exp \bigg( x \log \lambda - \log \big( x! \big) \bigg) \\ \end{align}\] \[\begin{align} \log P(n) &= \log \Big( \frac{1}{N} \Big) \\ &= - \log N \\ \end{align}\] \[\begin{align} \log P(\lambda_1; \alpha, \beta) &= \alpha \log \beta - \log \Gamma(\alpha) + (\alpha - 1) \log \lambda_1 -\beta \lambda_1 \\ &= Z_1(\alpha, \beta) + (\alpha - 1) \log \lambda_1 -\beta \lambda_1 \\ &=^{+} (\alpha - 1) \log \lambda_1 -\beta \lambda_1 \\ \end{align}\] \[\begin{align} \log P(\lambda_2; \alpha, \beta) &= \alpha \log \beta - \log \Gamma(\alpha) + (\alpha - 1) \log \lambda_2 -\beta \lambda_2 \\ &= Z_2(\alpha, \beta) + (\alpha - 1) \log \lambda_2 -\beta \lambda_2 \\ &=^{+} (\alpha - 1) \log \lambda_2 -\beta \lambda_2 \\ \end{align}\] \[\begin{align} \log P(x_{1:n} | \lambda_1) &= \log \bigg( \prod_{i=1}^{n} P(x_i | \lambda_1) \bigg) \\ &= \sum_{i=1}^{n} \log P(x_i | \lambda_1) \\ &= \sum_{i=1}^{n} \Big( x_i \log \lambda_1 - \log \big( x_i! \big) -\lambda_1 \Big) \\ \end{align}\] \[\begin{align} \log P(x_{n+1:N} | \lambda_2) &= \log \bigg( \prod_{i=n+1}^{N} P(x_i | \lambda_2) \bigg) \\ &= \sum_{i=n+1}^{N} \log P(x_i | \lambda_2) \\ &= \sum_{i=n+1}^{N} \Big( x_i \log \lambda_2 - \log \big( x_i! \big) -\lambda_2 \Big) \\ \end{align}\]Therefore,

\[\begin{align} \log P(n, \lambda_1, \lambda_2 | x_{1:N}) &= \alpha \log \beta - \log \Gamma(\alpha) + (\alpha - 1) \log \lambda_1 -\beta \lambda_1 \\ &\quad + \alpha \log \beta - \log \Gamma(\alpha) + (\alpha - 1) \log \lambda_2 -\beta \lambda_2 \\ &\quad + \sum_{i=1}^{n} \Big( x_i \log \lambda_1 - \log \big( x_i! \big) -\lambda_1 \Big) \\ &\quad + \sum_{i=n+1}^{N} \Big( x_i \log \lambda_2 - \log \big( x_i! \big) -\lambda_2 \Big) \\ &\quad - \log N \\ \end{align}\]In fact, once we have derived $P(n, \lambda_1, \lambda_2 | x_{1:N})$, although it is still feasible to compute the integrals to further derive $P(n | x_{1:N})$ by integrals, the derivation becomes very complicated. Not to mention for more complex problems, the number of latent variables will be much larger.

Since deriving the close form distribution is difficult, let’s see if we can approximate the distribution by Gibbs sampling.

#### Gibbs Sampling

To do Gibbs sampling for this particular problem, we need three posterior conditional distributions, including $P(\lambda_1 | n, \lambda_2, x_{1:N})$, $P(\lambda_2 | n, \lambda_1, x_{1:N})$ and $P(n | \lambda_1, \lambda_2, x_{1:N})$.

Because $\lambda_1$ and $\lambda_2$ are independent from each other,

\[\begin{align} P(\lambda_1 | n, \lambda_2, x_{1:N}) &= P(\lambda_1 | n, \lambda_2, x_{1:n}, x_{n+1:N}) \\ &= P(\lambda_1 | n, x_{1:n}) \\ &= P(\lambda_1 | x_{1:n}) \\ &= \frac{P(\lambda_1, x_{1:n})}{P(x_{1:n})} \\ &= \frac{P(\lambda_1)P(x_{1:n} | \lambda_1)}{ \int_{}^{} P(\lambda_1)P(x_{1:n} | \lambda_1) d \lambda_1 } \\ &= \frac{ P(\lambda_1)P(x_{1:n} | \lambda_1)}{ Z(x_{1:n}) } \\ \end{align}\]where $Z(x_{1:n})$ is a normalization constant that is not dependent on $\lambda_1$.

Similarly,

\[\begin{align} P(\lambda_2 | n, \lambda_1, x_{1:N}) &= \frac{ P(\lambda_2)P(x_{n+1:N} | \lambda_2)}{ Z(x_{n+1:N}) } \\ \end{align}\]where $Z(x_{n+1:N})$ is a normalization constant that is not dependent on $\lambda_2$.

\[\begin{align} P(n | \lambda_1, \lambda_2, x_{1:N}) &= \frac{ P(n, \lambda_1, \lambda_2, x_{1:N}) } { P(\lambda_1, \lambda_2, x_{1:N}) } \\ &= \frac{ P(n) P(\lambda_1) P(\lambda_2) P(x_{1:n} | \lambda_1) P(x_{n+1:N} | \lambda_2) } { \int P(n) P(\lambda_1) P(\lambda_2) P(x_{1:n} | \lambda_1) P(x_{n+1:N} | \lambda_2) dn } \\ &= \frac{ P(n) P(\lambda_1) P(\lambda_2) P(x_{1:n} | \lambda_1) P(x_{n+1:N} | \lambda_2) } { P(\lambda_1) P(\lambda_2) \int P(n) P(x_{1:n} | \lambda_1) P(x_{n+1:N} | \lambda_2) dn } \\ &= \frac{ P(n) P(x_{1:n} | \lambda_1) P(x_{n+1:N} | \lambda_2) } { \int P(n) P(x_{1:n} | \lambda_1) P(x_{n+1:N} | \lambda_2) dn } \\ &= \frac{ P(n) P(x_{1:n} | \lambda_1) P(x_{n+1:N} | \lambda_2) } { Z(\lambda_1, \lambda_2, x_{1:N}) } \\ \end{align}\]where $Z(\lambda_1, \lambda_2, x_{1:N})$ is a normalization constant that is not dependent on $n$.

\[\begin{align} \log P(\lambda_1 | n, \lambda_2, x_{1:N}) &= \log \frac{ P(\lambda_1)P(x_{1:n} | \lambda_1)}{ Z(x_{1:n}) } \\ &= \log P(\lambda_1) + \log P(x_{1:n} | \lambda_1) - \log Z(x_{1:n}) \\ &= Z_1(\alpha, \beta) + (\alpha - 1) \log \lambda_1 -\beta \lambda_1 + \sum_{i=1}^{n} \Big( x_i \log \lambda_1 - \log \big( x_i! \big) -\lambda_1 \Big) - \log Z(x_{1:n}) \\ &= (\alpha - 1) \log \lambda_1 - \beta \lambda_1 + \Big( \sum_{i=1}^{n} x_i \Big) \log \lambda_1 - n \lambda_1 + Z(n, \lambda_2, x_{1:N}) \\ &= \bigg( \alpha - 1 + \sum_{i=1}^{n} x_i \bigg) \log \lambda_1 - (n + \beta ) \lambda_1 + Z(n, \lambda_2, x_{1:N}) \\ &=^+ \bigg( \alpha - 1 + \sum_{i=1}^{n} x_i \bigg) \log \lambda_1 - (n + \beta ) \lambda_1 \\ &= \log \text{Gamma}\bigg( \lambda_1; \alpha - 1 + \sum_{i=1}^{n} x_i, n + \beta \bigg) \\ \end{align}\]Similarly,

\[\begin{align} \log P(\lambda_2 | n, \lambda_1, x_{1:N}) &= \log \text{Gamma}\bigg( \lambda_2; \alpha - 1 + \sum_{i=n+1}^{N} x_i, N - n + \beta \bigg) \\ \end{align}\]This means that the posterior conditional distributions $P(\lambda_1 | n, \lambda_2, x_{1:N})$ and $P(\lambda_2 | n, \lambda_1, x_{1:N})$ are both gamma distributions.

This can also be understood from the perspective of conjugate priors. Gamma and Poisson distributions both belong to the exponential family and all members of the exponential family have conjugate priors. Although I don’t know the closed form of the likelihood distribution $P(x_{1:n} | \lambda_1)$, we could derive the posterior distribution step by step. The prior distribution $P(\lambda_1)$ follows gamma distribution, the likelihood distribution $P(x_1 | \lambda_1)$ follows Poisson distribution, then the posterior distribution $P(\lambda_1 | x_1)$ must follow gamma distribution. Now, $P(\lambda_1 | x_1)$ becomes the prior distribution, the likelihood distribution $P(x_2 | \lambda_1, x_1) = P(x_2 | \lambda_1)$ follows Poisson distribution, then the posterior distribution $P(\lambda_1 | x_1, x_2)$ must follow gamma distribution. We keep iterating this step and it is no surprise that $P(\lambda_1 | x_{1:n}) = P(\lambda_1 | n, \lambda_2, x_{1:N}) $ follows gamma distribution.

\[\begin{align} \log P(n | \lambda_1, \lambda_2, x_{1:N}) &= \log \frac{ P(n) P(x_{1:n} | \lambda_1) P(x_{n+1:N} | \lambda_2) } { Z(\lambda_1, \lambda_2, x_{1:N}) } \\ &= \log P(n) + \log P(x_{1:n} | \lambda_1) + P(x_{n+1:N} | \lambda_2) - \log Z(\lambda_1, \lambda_2, x_{1:N}) \\ &= - \log N + \sum_{i=1}^{n} \Big( x_i \log \lambda_1 - \log \big( x_i! \big) -\lambda_1 \Big) \\ &\quad + \sum_{i=n+1}^{N} \Big( x_i \log \lambda_2 - \log \big( x_i! \big) -\lambda_2 \Big) - \log Z(\lambda_1, \lambda_2, x_{1:N}) \\ &=^{+} \sum_{i=1}^{n} \Big( x_i \log \lambda_1 - \log \big( x_i! \big) -\lambda_1 \Big) + \sum_{i=n+1}^{N} \Big( x_i \log \lambda_2 - \log \big( x_i! \big) -\lambda_2 \Big) \\ &= \bigg( \Big( \sum_{i=1}^{n} x_i \Big) \log \lambda_1 - \Big( \sum_{i=1}^{n} \log \big( x_i! \big) \Big) - n\lambda_1 \bigg) \\ &\quad + \bigg( \Big( \sum_{i=n+1}^{N} x_i \Big) \log \lambda_2 - \Big( \sum_{i=n+1}^{N} \log \big( x_i! \big) \Big) - (N - n)\lambda_2 \bigg) \\ &= \Big( \sum_{i=1}^{n} x_i \Big) \log \lambda_1 + \Big( \sum_{i=n+1}^{N} x_i \Big) \log \lambda_2 - \Big( \sum_{i=1}^{N} \log \big( x_i! \big) \Big) \\ &\quad - n\lambda_1 - (N - n)\lambda_2 \\ &=^{+} \Big( \sum_{i=1}^{n} x_i \Big) \log \lambda_1 + \Big( \sum_{i=n+1}^{N} x_i \Big) \log \lambda_2 - n\lambda_1 - (N - n)\lambda_2 \\ \end{align}\]Therefore, $P(n | \lambda_1, \lambda_2, x_{1:N}) $ must be computed using the following form.

\[\begin{align} P(n | \lambda_1, \lambda_2, x_{1:N}) &= \frac{ \Big( \sum_{i=1}^{n} x_i \Big) \log \lambda_1 + \Big( \sum_{i=n+1}^{N} x_i \Big) \log \lambda_2 - n\lambda_1 - (N - n)\lambda_2 }{ \sum_{j=1}^{N} \bigg( \Big( \sum_{i=1}^{j} x_i \Big) \log \lambda_1 + \Big( \sum_{i=j+1}^{N} x_i \Big) \log \lambda_2 - j\lambda_1 - (N - j)\lambda_2 \bigg) } \end{align}\]Although it is not favorable, this time we have to compute the denominator of the probability distribution, sometimes being called as evidence, which is usually intractable with respect to the number of variables (see an example in variational inference). However, because we only have one discrete variable and the asymptotic complexity for computing the enumerator is $O(N)$, computing the denominator is still tractable with an asymptotic complexity of $O(N^2)$.

Sampling from the discrete distribution $P(n | \lambda_1, \lambda_2, x_{1:N})$ could be easily done by turning it to the multinomial distribution $\text{Mult}(k, P(n | \lambda_1, \lambda_2, x_{1:N}))$ and using the multinomial sampling functions from the existing libraries. Since the Gibbs sampling algorithm only samples one sample from the posterior distribution, the sampling size parameter $k$ for the multinomial distribution is just $1$. Alternatively, since we only want one sample, we could just directly sample from the discrete distribution $P(n | \lambda_1, \lambda_2, x_{1:N})$.

Taking one step back. In fact, the correct and a much simpler way to compute $\log P(\lambda_1 | n, \lambda_2, x_{1:N})$, $\log P(\lambda_2 | n, \lambda_1, x_{1:N})$, and $\log P(n | \lambda_1, \lambda_2, x_{1:N})$ is just to pick the related terms from $\log P(n, \lambda_1, \lambda_2 | x_{1:N})$.

Let’s see why by looking at $\log P(\lambda_1 | n, \lambda_2, x_{1:N})$ as an example.

\[\begin{align} \log P(\lambda_1 | n, \lambda_2, x_{1:N}) &= \log \frac{P(n, \lambda_1, \lambda_2 | x_{1:N})}{P(n, \lambda_2 | x_{1:N})} \\ &= \log \frac{P(n, \lambda_1, \lambda_2 | x_{1:N})}{\int_{\lambda_1} P(n, \lambda_1, \lambda_2 | x_{1:N}) d\lambda_1} \\ &= \log P(n, \lambda_1, \lambda_2 | x_{1:N}) - \log \int_{\lambda_1} P(n, \lambda_1, \lambda_2 | x_{1:N}) d\lambda_1 \\ &= \log P(n, \lambda_1, \lambda_2 | x_{1:N}) - Z(n, \lambda_2, x_{1:N}) \\ \end{align}\]Note that $P(\lambda_1 | n, \lambda_2, x_{1:N})$ is a probability distribution with respect to $\lambda_1$. Because $Z(n, \lambda_2, x_{1:N})$ does contain $\lambda_1$, all the $\lambda_1$ related terms from $\log P(n, \lambda_1, \lambda_2 | x_{1:N})$ must retain after subtraction.

\[\begin{align} \log P(\lambda_1 | n, \lambda_2, x_{1:N}) &= \log P(n, \lambda_1, \lambda_2 | x_{1:N}) - Z(n, \lambda_2, x_{1:N}) \\ &= \bigg( \alpha - 1 + \sum_{i=1}^{n} x_i \bigg) \log \lambda_1 - (n + \beta ) \lambda_1 \\ &\quad + Z^{\prime}(n, \lambda_2, x_{1:N}) + Z(n, \lambda_2, x_{1:N}) \\ &=^+ \bigg( \alpha - 1 + \sum_{i=1}^{n} x_i \bigg) \log \lambda_1 - (n + \beta ) \lambda_1 \\ &= \log \text{Gamma}\bigg( \lambda_1; \alpha - 1 + \sum_{i=1}^{n} x_i, n + \beta \bigg) \\ \end{align}\]So this means that if we know the exact formula of the joint distribution, **computing the conditional distributions does not require applying the brain-twisting Bayes’ theorem**. We just have to cherry-pick the related terms with respect to the variable in the conditional distributions as the enumerator of the probability function. This is very useful because sometimes we were only given a joint distribution and there is no additional information to apply Bayes’ theorem.

So far so good. We have derived the three posterior conditional distributions. The next step is to write a computer program to draw samples using the Gibbs sampling algorithm.

The Gibbs sampling Python implementation for the change-point model is revised from the Computational Cognition Cheat Sheets by brainlessly upgrading code from Python 2 to Python 3 and adjusting the figure settings. The code could download here.

We could see that the approximate distribution $P(n | x_{1:N})$ and $\mathbb{E} [n]$ are quite good. The reader could also play with the $\alpha$ and $\beta$ for the prior Gamma distribution. The sampling and approximate distribution are quite robust to these parameters.

### Conclusion

Gibbs sampling is very useful in practice. Given a joint distribution $P(X_1, X_2, X_3, \cdots, X_D)$, we need to find out all the conditional distributions, including $P(X_1 | X_2, X_3, \cdots, X_D)$, $P(X_2 | X_1, X_3, \cdots, X_D)$, $P(X_3 | X_1, X_2, \cdots, X_D)$, etc., which is usually trivial to do. Then we could sample using the Gibbs sampling algorithm.

### Caveats

Some people was wondering what the difference between $P(X_i | X_1 = x_1, \cdots, X_{i-1} = x_{i-1}, X_{i+1} = x_{i+1}, \cdots, X_D = x_D)$ and $P(X_1 = x_1, \cdots, X_{i-1} = x_{i-1}, X_i, X_{i+1} = x_{i+1}, \cdots, X_D = x_D)$ is.

Given a probability distribution $P(X_1, X_2, X_3, \cdots, X_D)$, in order to do Gibbs sampling, we have to derive the probability distribution $P(X_i | X_1, \cdots, X_{i-1}, X_{i+1}, \cdots, X_D)$ for every $i \in [1,D]$. Some people thought it is tedious. Why can’t we just plugin the values $\{ x_1, \cdots, x_{i-1}, x_{i+1}, \cdots, x_D \}$ to $P(X_1, X_2, X_3, \cdots, X_D)$ and directly sample $X_i$ from $P(X_1 = x_1, \cdots, X_{i-1} = x_{i-1}, X_i, X_{i+1} = x_{i+1}, \cdots, X_D = x_D)$.

The answer is that $P(X_1 = x_1, \cdots, X_{i-1} = x_{i-1}, X_i, X_{i+1} = x_{i+1}, \cdots, X_D = x_D)$ is not a probability distribution and it does not sum to $1$. We cannot sample from a non-distribution. Knowing the answer and looking back, we might find the question silly. But this is often confusing, especially for the beginners.