Lei Mao bio photo

Lei Mao

Machine Learning, Artificial Intelligence, Computer Science.

Twitter Facebook LinkedIn GitHub   G. Scholar E-Mail RSS

Introduction

In statistics, a sampling distribution or finite-sample distribution is the probability distribution of a given random-sample-based statistic. Sometimes, it is important to know the sampling distributions if we want to address some questions about uncertainties. For example, given a population whose mean $\mu$ is unknown but standard deviation $\sigma$ is known, and a sample of size $n$ from the population, what is the $95\%$ confidence interval for the population mean $\mu$?


This question can be addressed if we know the sample size $n$ is sufficiently large. According to the central limit theorem, as the sample size $n$ becomes sufficiently large, the sampling distribution for sample mean $\overline{X}$ approximates to normal distribution $\mathcal{N}(\mu, \frac{\sigma^2}{n})$, regardless of the population distribution. With the sampling distribution information, since $\overline{X} - \mu \sim \mathcal{N}(0, \frac{\sigma^2}{n})$, it is very simple to compute the $95\%$ confidence interval for the population mean $\mu$.


In practice, usually we don’t know whether the sample size $n$ is sufficiently large or not. We cannot apply central limit theorem to derive the sampling distribution for the sample mean, so the sampling distribution for the sample mean remains unknown. If we happen to know the population distribution, it is possible to derive or simulate the sampling distributions. The sampling distributions for some of the well known population distributions, such as normal distribution, could be found on Wikipedia. However, in practice, we still often don’t know what the population distribution is. Then, how should we obtain the sampling distribution if we don’t have too much useful information? Bootstrap is a statistical methods which extracts bootstrap distribution from the original sample by random sampling with replacement. The bootstrap distribution approximates the sampling distribution, and can be used for deriving useful statistical conclusions.


In this blog post, I would like to discuss Bootstrap methods with some concrete examples.

Bootstrap Methods

Nonparametric (Resampling) Bootstrap

The vanilla Bootstrap method, also called as the nonparametric bootstrap, is very simple. Conceptually, it considers the original sample as the population and we draw new bootstrap samples from the original sample. Because we are drawing new bootstrap samples from the original sample, in principle, we can draw as many new bootstrap samples as we want. The statistics computed from each of the new bootstrap samples form a distribution. This distribution is called “bootstrap distribution” which approximates the sampling distribution. Note that since the population size is usually much larger than the original sample size, when we sample from the original sample, we have to sample with replacement.


For example, we have a sample of size $5$, $\{8, 1, 0, 2, 5\}$, and we want to have some idea about the sampling distribution for the sample mean. So we will collect $n$ bootstrap samples of size $5$ with replacement from the original sample, such as

\[\begin{gather} X_1 = \{1, 0, 2, 5, 1\} \\ X_2 = \{8, 8, 1, 0, 1\} \\ X_3 = \{8, 2, 1, 5, 0\} \\ X_4 = \{1, 2, 1, 5, 1\} \\ \vdots \\ X_n = \{1, 0, 0, 8, 5\} \\ \end{gather}\]

and compute the sample statistics, sample mean in this case, for each of the bootstrap sample. So we have

\[\begin{gather} \overline{X}_1 = 1.8 \\ \overline{X}_2 = 3.6 \\ \overline{X}_3 = 3.2 \\ \overline{X}_4 = 2.0 \\ \vdots \\ \overline{X}_n = 2.8 \\ \end{gather}\]

The sample statistics could plotted using histograms to represent the bootstrap distribution, which further approximates the sampling distribution.


With the bootstrap distribution, we could estimate the $95\%$ confidence interval for the sample mean and population mean by dropping the $2.5\%$ of the areas from both sides of the bootstrap distribution.


Note $n$ is usually very large so that the bootstrap distribution is representative.

Semiparametric Bootstrap

The semiparametric bootstrap is similar to the nonparametric bootstrap. What’s different is that semiparametric bootstrap assumes the population includes other items that are similar to the observed sample by sampling from a smoothed version of the sample histogram,and adds noise to the bootstrap samples so that the bootstrap samples are not likely to be identical to the original sample. The smoothing parameters, however, have to be set up empirically.


Taking the sample of size $5$, $\{8, 1, 0, 2, 5\}$, as an example, we could plot the sample as histogram, does interpolation resulting a finer histogram, and sample bootstrap samples from the finer histogram. Obviously, this might not be a good idea in this case because our sample size is just too small.

Parametric Bootstrap

Parametric bootstrapping assumes that the samples come from a known distribution with unknown parameters. Even with a small sized sample, we estimate the parameters for the known distribution. Then we sample bootstrap samples from the known distribution with the estimated parameters.


Taking the sample of size $5$, $\{8, 1, 0, 2, 5\}$, as an example. If we believe the samples come from uniform distribution $U[a, b]$. We estimate that $a = 0$ and $b = 8$. Then we draw bootstrap examples from uniform distribution $U[0, 8]$. Obviously, this might not be a good idea in this case because our sample size is just too small.

Bootstrap Methods in Statistical Inference and Machine Learning

If there is no evaluation set or dataset split, we simply draw $n$ bootstrap samples as usual from the original samples, perform statistical inference or machine learning using the bootstrap samples, resulting $n$ models. For statistical inference, we will have a distribution for all the statistics inferred. For machine learning, we will have an ensemble model consisting $n$ models trained with bootstrap samples. This machine learning approach is sometimes called bagging or bootstrap aggregating. If the machine learning model is a decision tree, it is also called random forest.


Sometimes, it is inevitable that we will have to split the sample to multiple samples, such as a training set and a test set. The training set will be used for training model or statistical inference, and the test set will be used for evaluating the model or statistical inference. To apply bootstrap sampling in this scenario, we randomly split the original sample into a training set and a test set, sample bootstrap samples from the training set and test set respectively, and used the bootstrap training set and the bootstrap test set for training and evaluation, respectively. This completes one round of bootstrap. We then repeat the same process $n$ times (we always do random split each time). This method, however, seems to be unfavorable and has been deprecated by Scikit-Learn.


Instead of doing bootstrap for training and evaluation, usually we do $K$-fold cross validation instead.

Conclusions

Bootstrap methods are basically simulating real samples with limited information. However, personally I am not a fan of bootstrap because I have not seen theories addressing how good the bootstrap distribution approximates the sampling distribution and under what scenario bootstrap should be or should not be used. Unless the sample size is very small and I got no better options, I would not consider using bootstrap methods.

References