Motivations for Sampling in Statistical Inference
Introduction
Sampling is very common in the society to model distributions and estimate variables. For a very common example, we randomly sample a large quantity of families to model the family income distribution and estimate the mean of family income. Here, family income is something that can be directly measured therefore estimating the mean of family income from the samples is not very interesting. What’s interesting is that if we want to model the family income distribution and there must be some latent variables or variables that are dependent on the latent variables that can hardly be directly observed from the samples, what the most likely values, i.e., the expected values, for these variables are. This process is sometimes called statistical inference.
The result of statistical inference could be further used for estimating some additional variables. However, unfortunately determining the expected values for these variables during statistical inference is difficult if the model is non-trivial.
In this blog post, I would like to discuss why determining the expected values for these variables is difficult and how to approximate the expected values for these variables by sampling.
Statistical Inference
Expected Values
Suppose we have a multivariate joint distribution $P(X_1, X_2, \cdots, X_n)$ and the mathematical form of the multivariate joint distribution is known. We will use $X = \{X_1, X_2, \cdots, X_n\}$ in this article for brevity. Sometimes, we would like to compute the expected values of $\mathbb{E}_{P(X)} \big[f(X_1, X_2, \cdots, X_n)\big]$ for some function $f$. For example, we would like to compute the expected value of the mean of $X_1, X_2, \cdots, X_n$, i.e., $\mathbb{E}_{P(X)} \big[ \frac{1}{n} \sum_{i=1}^{n} X_i \big] $.
Using the property of expected values,
$$
\begin{align}
\mathbb{E}_{P(X)} \bigg[ \frac{1}{n} \sum_{i=1}^{n} X_i \bigg]
&= \frac{1}{n} \sum_{i=1}^{n} \mathbb{E}_{P(X)} \big[ X_i \big] \\
&= \frac{1}{n} \sum_{i=1}^{n} \int_{x \in X_i}^{} P( X_i = x ) x dx \\
\end{align}
$$
According to the property of total probability,
$$
\begin{align}
P(X_i ) &= \int_{X_1} \int_{X_2} \cdots \int_{X_{i-1}} \int_{X_{i+1}} \cdots \int_{X_n} \\
& \quad P(X_1, X_2, \cdots, X_{i-1}, X_i, X_{i+1}, \cdots, X_n) \\
& \quad d X_1 d X_2 \cdots d X_{i-1} d X_{i+1} \cdots d X_n \\
&= \oint_{X - X_i} P(X_1, X_2, \cdots, X_{i-1}, X_i, X_{i+1}, \cdots, X_n = X_n) \prod_{j \neq i} d X_j
\end{align}
$$
Usually the form of $P(X_1, X_2, \cdots, X_n)$ is non-trivial, and this integral will immediately become intractable to derive if the number of variables $n$ becomes large. The reader might check such an example in variational inference.
Now we are stuck. We could not compute the expected value of the mean of $X_1, X_2, \cdots, X_n$, even if the joint distribution is known, because we cannot derive the probability distribution $P(X_i)$ for each single variable $X_i$.
Approximate By Sampling
Although we cannot derive the probability distribution $P(X_i)$ and further the expected value of some of the functions of variables that we are interested, if we can get some multivariate samples from the distribution $P(X_1, X_2, \cdots, X_n)$, we can empirically determine the distribution of $P(X_i)$ and its statistics, such as the expected value, from the samples.
Going back to the example of computing the expected value of the mean of the variables, if we can get $N$ samples, $x^{(1)}, x^{(2)}, \cdots, x^{(N)}$ where $x^{(i)} = \{x_1^{(i)}, x_2^{(i)}, \cdots, x_n^{(i)}\}$, from $P(X_1, X_2, \cdots, X_n)$, because the law of large numbers, the mean of the variables could be estimated as the mean of the mean of the variables in the samples. Concretely,
$$
\begin{align}
\mathbb{E}_{P(X)} \bigg[ \frac{1}{n} \sum_{i=1}^{n} X_i \bigg]
&= \frac{1}{n} \sum_{i=1}^{n} \mathbb{E}_{P(X)} \big[ X_i \big] \\
&\approx \frac{1}{n} \sum_{i=1}^{n} \frac{1}{N} \sum_{j=1}^{N} x_i^{(j)} \\
&= \frac{1}{nN} \sum_{i=1}^{n} \sum_{j=1}^{N} x_i^{(j)} \\
\end{align}
$$
More generally, if we are interested in the expected value of a random variable $f(X)$ and the (posterior) distribution $P(X)$, given $N$ samples, $x^{(1)}, x^{(2)}, \cdots, x^{(N)}$, we could estimate expected value of a random variable as
$$
\begin{align}
\mathbb{E}_{P(X)} \approx \frac{1}{N} \sum_{i=1}^{N} f(x^{(i)})
\end{align}
$$
Sampling Methods
Now the most important question becomes how to generate samples from a complex multivariate joint distribution. We could imagine, without special methods, we could not even write a program to do samplings from a “simple” multivariate Gaussian distribution easily. There are sampling methods such as Gibbs sampling and Metropolis-Hastings algorithm to do a sequence of sampling from conditional univariate distributions that approximates the sampling from the multivariate joint distribution. I will write additional articles on these methods in the future.
Conclusion
Sampling is critical for statistical inference, especially from a multivariate joint posterior distribution for latent variables. The samples could be used for estimate variables, approximate joint distributions or marginal distributions.
Quiz
How to do sampling from a univariate distribution using an existing uniform sampler?
A simple approach is to derive the cumulative distribution function (CDF) $F_X(X = x) = P(X \leq x) = y \in [0, 1]$ for the univariate distribution $P(X)$, if CDF does exist. The inverse CDF is $F_X^{-1}(Y)$. Then with the uniform sampler $Y \sim U[ 0, 1 ]$, $F_X^{-1}(Y)$ has the same distribution as $P(X)$.
Proof
$$
\begin{align}
F_Y(y) &= P(Y \leq y) \\
&= P(F_X(x) \leq y) \\
&= P(X \leq F_X^{-1}(y)) \\
&= F_X(F_X^{-1}(y)) \\
&= y
\end{align}
$$
$$
\begin{align}
P(Y = y) &= \frac{dF_Y}{d_Y}(y) \\
&= 1 \\
\end{align}
$$
Therefore, for any distribution $P(X)$, $F_X(X) = Y \sim U[0, 1]$. The inverse function $X = F_X^{-1}(Y)$ maps from variable $Y$ to $X$, so $F_X^{-1}(Y)$ has the same distribution as $X$.
Why for some modern deep models, doing sampling is not very difficult during inference?
Many models, such as image semantic segmentation models, are maximizing $P(Y_1, Y_2, \cdots, Y_n | X_1, X_2, \cdots, X_m) = \prod_{i=1}^{n} P(Y_i | X_1, X_2, \cdots, X_m)$ during training, and during inference sampling from each of the univariate distribution $P(Y_i | X_1, X_2, \cdots, X_m)$ is relatively simple. Some other models, such as recurrent neural networks for language modeling, are maximizing $P(X_1, X_2, \cdots, X_n) = P(X_1) P(X_2 | X_1) \cdots P(X_n | X_1, X_2, \cdots, X_{n-1})$ during training, and during inference sampling from each of the univariate distribution $P(X_i | X_1, X_2, \cdots, X_{i-1})$ is relatively simple.
Motivations for Sampling in Statistical Inference
https://leimao.github.io/blog/Statistical-Sampling-Motivations/