Lei Mao bio photo

Lei Mao

Machine Learning, Artificial Intelligence, Computer Science.

Twitter Facebook LinkedIn GitHub   G. Scholar E-Mail RSS

Introduction

Standard deep learning only learns a single model that could best represent the training data, rather than model ensembles which Bayesian learning learns. When it comes to prediction, given an input, the model could only predict one output value. Therefore, standard deep learning models for regression and classification do not capture model uncertainty. In classification, predictive probabilities obtained at the end of the pipeline, such as the softmax output, are often erroneously interpreted as model confidence.


Standard deep learning model only learns $y = f(x)$, whereas Bayesian model learns $p(\mathbf{y}|x)$, where $\mathbf{y}$ is an univariate or multivariate variable. Statistically, the predicted value $y$ from standard deep learning model is sampled from the distribution $p(\mathbf{y}|x)$ that Bayesian model predicts, i.e., $y \sim p(\mathbf{y}|x)$. In terms of regression, $y$ is usually a scalar value, we could evaluate the prediction uncertainty or confidence using metrics such as variance or standard deviation. In terms of classification, $y$ is usually an array that sums to $1.0$, we could evaluate the prediction uncertainty or confidence using metrics such as Shannon entropy. The Shannon entropy for standard deep learning classification model is $H(y)$, whereas the Shannon entropy for Bayesian classification model is usually computed using $H(\mathbb{E}(\mathbf{y}))$.


In this blog post, I would like to systematically discuss the deterministic overconfidence issues of uncertainty quantification in standard deep learning.

Prerequisites

Multivariate Jensen’s Inequality

In my previous blog post on “Multivariate Jensen’s Inequality”, we have proved Jensen’s inequality for the multivariate case.


If $\mathbf{X} \in \mathbb{R}^n$ is random multivariate variable and $f: \mathbb{R}^n \rightarrow \mathbb{R}$ is a convex function, then

\[f(\mathbb{E}[\mathbf{X}]) \leq \mathbb{E}[f(\mathbf{X})]\]

Similarly, if $f$ is a concave function, then

\[f(\mathbb{E}[\mathbf{X}]) \geq \mathbb{E}[f(\mathbf{X})]\]

Shannon Entropy

The discrete case of Shannon entropy is defined as follows.

\[H(p) = - \sum_{i=1}^{n} p(x_i) \log_b p(x_i)\]

where $n$ is the number of discrete states, $p(x_i)$ is the probability of the system being in the state $i$, and $\sum_{i=1}^{n} p(x_i) = 1$.


Shannon entropy could be used for measuring the uncertainty of a system, such as a machine learning classifier model.


For example, if a binary classifier predicts an input $x$ to be class $y^{+}$ and $y^{-}$ with a probability of $p = (0.999, 0.001)$. The Shannon entropy $H(p) \approx 0$, meaning that there is almost no uncertainty with the prediction and the system is almost 100% sure about the prediction. On the other hand, when $p = (0.500, 0.500)$, the Shannon entropy, if using $b = 2$, $H(p) = 1.0$ which is the largest entropy when using $b = 2$. This means that the uncertainty is the largest and the system is 100% unsure about the prediction.


Shannon entropy is strictly concave with respect to $p$. Here we would provide a quick proof.


Proof


We have defined the space of probabilities for $p$.

\[P = \{(p_1, p_2, \cdots, p_n): 0 < p_i < 1, \sum_{i=1}^{n} p_i = 1 \}\]

Given $p \in P$, and a real-numbered vector $q = (q_1, q_2, \cdots, q_n) \in \mathbb{R}^n$ such that $\sum_{i=1}^{n} q_i = 0$ and $q \neq \mathbf{0}$, there must exist a small range $\lambda \in [u, v]$ where $p + \lambda q \in P$. Then we have

\[H(p + \lambda q) = - \sum_{i=1}^{n} (p_i + \lambda q_i) \log_b (p_i + \lambda q_i)\]

The derivatives with respect to $\lambda$ are

\[\begin{align} \frac{d H}{d \lambda} &= - \sum_{i=1}^{n} \Big [ q_i \log_b (p_i + \lambda q_i) + (p_i + \lambda q_i) \frac{1}{(p_i + \lambda q_i) \ln b} \Big] \\ &= - \sum_{i=1}^{n} \Big [ q_i \log_b (p_i + \lambda q_i) + \frac{1}{\ln b} \Big] \\ &= - \sum_{i=1}^{n} q_i \log_b (p_i + \lambda q_i) + \frac{n}{\ln b} \\ \end{align}\] \[\begin{align} \frac{d^2 H}{d \lambda^2} &= - \sum_{i=1}^{n} \Big[ q_i \frac{q_i}{(p_i + \lambda q_i) \ln b} \Big]\\ &= - \frac{1}{\ln b} \sum_{i=1}^{n} \Big[ \frac{q_i^2}{(p_i + \lambda q_i)} \Big]\\ \end{align}\]

Because $p + \lambda q \in P$, $0 < p_i + \lambda q_i < 1$ for any $i \in [1, n]$. Therefore,

\[\begin{align} \frac{d^2 H}{d \lambda^2} < 0 \end{align}\]

This concludes the proof that Shannon entropy is strictly concave with respect to $p$.

Deterministic Overconfidence

Deterministic overconfidence states that for classification models, we have

\[\mathbb{E}(H(\mathbf{y})) \leq H(\mathbb{E}(\mathbf{y}))\]

In layman’s terms, this means that predicting a single output given an input using the standard deep learning model is very likely to have lower Shannon entropy compared to the expected value of Shannon entropy uncertainty from Bayesian model ensembles.


A concrete example is from the Shannon entropy computed from the probabilities using softmax function published in Yarin Gal’s paper “Dropout as a Bayesian Approximation: Representing Model Uncertainty in Deep Learning”.

Binary Cross Entropy Deterministic Overconfidence

We are looking at the figure (b) in particular. If the input to the softmax function is the logits from the standard deep learning model, the output $y$ will be mostly likely the expected value $\mathbb{E}(\mathbf{y}) \approx 1.0$. Therefore, the Shannon entropy $H(y) \approx 0$ and $\mathbb{E}(H(\mathbf{y})) \approx 0.0$, meaning that the model is 100% sure most of the time about the predicted class even though the model has never seen the input $x$ and has to do extrapolation in order to predict. This suggest that apply the uncertainty quantification to the standard deep learning model is erroneous.


Let’s further take a look at what the Shannon entropy for the Bayesian model predictions. The Bayesian model predicts the distribution $\mathbf{y}$, and we could compute the expected value of the Shannon entropy $\mathbb{E}(H(\mathbf{y}))$. We could see from the figure that a lot of samples $y \sim p(\mathbf{y}|x)$ are away from $1.0$, therefore $\mathbb{E}(\mathbf{y})$ should be smaller than $1.0$, and $H(\mathbb{E}(\mathbf{y}))$ should be larger than $0$.


This analysis matches our statement of deterministic overconfidence that

\[\mathbb{E}(H(\mathbf{y})) \leq H(\mathbb{E}(\mathbf{y}))\]

Proof to Deterministic Overconfidence

The proof to deterministic overconfidence is extremely simple, given the prerequisites that have been shown early in the post.


Because Shannon entropy $H$ is strictly concave even for the multivariate case, we apply the multivariate Jensen’s inequality, we have

\[\mathbb{E}(H(\mathbf{y})) \leq H(\mathbb{E}(\mathbf{y}))\]

This concludes the proof.

Extensions

We have discussed the deterministic overconfidence for classification models. How about regression models. The short answer is that there is also deterministic overconfidence for regression models. If the uncertainty is measured using variance, without showing the proof formally, variance is also a concave function. By applying the multivariate Jensen’s inequality as we did for the Shannon entropy, we could also reach the same conclusion for regression models.

Caveats

The proof to deterministic overconfidence from the AWS Prescriptive Guidance is incorrect. In case AWS changes the web content, the PDF version of the doc could be found here in the Appendix A section.


There are two major issues in their argument and proof.


The deterministic overconfidence was not only restricted to using softmax but also other activation functions.


The major mistake they made is that the random variable $\mathbf{u}$ was defined between $[-\infty, +\infty]$, and it could not be separated into two regions and apply Jensen’s inequality separately.

References