Bounding Box Encoding and Decoding in Object Detection

Introduction

In modern object detection programs, the model usually has an object classifier and a bounding box regressor. The bounding box usually consists of four parameters. Intuitively they could be the center coordinates of the bounding box, width, and height of the bounding box. I remembered that in my very first object detection program of digit number localization in 2015, I was using such kind of naive bounding box and it worked reasonably well. Nowadays, the bounding box still consists of four parameters, but the four parameters usually have been encoded. Some encoding methods are obscure and are not publicly well discussed in research papers. In this blog post, we are going to look at some of these methods and talk about the motivations behind them.

Bounding Box Regression

Most recently object detection programs have the concept of anchor boxes, also called prior boxes, which are pre-defined fix-sized bounding boxes on image input or feature map. The bounding box regressor, instead of predicting the bounding box location on the image, predicts the offset of the ground-truth/predicted bounding box to the anchor box. For example, if the anchor box representation is [0.2, 0.5, 0.1, 0.2], and the representation of the ground-truth box corresponding to the anchor box is [0.25, 0.55, 0.08, 0.25], the prediction target, which is the offset, should be [0.05, 0.05, -0.02, 0.05]. The object detection bounding box regressor is trying to learn how to predict this offset. If you have the prediction and the corresponding anchor box representation, you could easily calculate back to predicted bounding box representation. This step is also called decoding.

Bounding Box Representation

The bounding box could be represented in many ways. Most intuitively, there are some ways as follows.

Centroids Representation

A bounding box could be represented as [$x$, $y$, $w$, $h$], where $x$ and $y$ are the coordinates of the bounding box centroid, $w$ and $h$ are the width and height of the bounding box.

Corners Representation

A bounding box could also be represented as [$x_{\text{min}}$, $y_{\text{min}}$, $x_{\text{max}}$, $y_{\text{max}}$], where $x_{\text{min}}$ and $y_{\text{min}}$ are the coordinates of the bounding box bottom-left corner, $x_{\text{max}}$ and $y_{\text{max}}$ are the coordinates of the bounding box top-right corner.

MinMax Representation

Similar to the corner representation, a bounding box could also be represented as [$x_{\text{min}}$, $x_{\text{max}}$, $y_{\text{min}}$, $y_{\text{max}}$], where $x_{\text{min}}$ and $x_{\text{max}}$ are the minimum and maximum of the $x$ coordinates, and $y_{\text{min}}$ and $y_{\text{max}}$ are the minimum and maximum of the $y$ coordinates. It is almost identical to the corner representation.

Bounding Box Encoding

The above bounding box representations are usually encoded as the final representation of the bounding box.

Centroids Representation Encoding

The encoded representation of a ground-truth bounding box [$x_{\text{gt}}$, $y_{\text{gt}}$, $w_{\text{gt}}$, $h_{\text{gt}}$] with the corresponding anchor box [$x_{\text{anchor}}$, $y_{\text{anchor}}$, $w_{\text{anchor}}$, $h_{\text{anchor}}$] is [$x’$, $y’$, $w’$, $h’$], where

$$
\begin{gather}
x’ = \frac{x_{\text{gt}} - x_{\text{anchor}}}{w_{\text{anchor}}} \\
y’ = \frac{y_{\text{gt}} - y_{\text{anchor}}}{h_{\text{anchor}}} \\
w’ = \ln{\bigg[\frac{w_{\text{gt}}}{w_{\text{anchor}}}\bigg]} \\
h’ = \ln{\bigg[\frac{h_{\text{gt}}}{h_{\text{anchor}}}\bigg]} \\
\end{gather}
$$

Corners Representation Encoding

The encoded representation of a ground-truth bounding box [$x_{\text{min, gt}}$, $y_{\text{min, gt}}$, $x_{\text{max, gt}}$, $y_{\text{max, gt}}$] with the corresponding anchor box [$x_{\text{min, anchor}}$, $y_{\text{min, anchor}}$, $x_{\text{max, anchor}}$, $y_{\text{max, anchor}}$] is [$x_{\text{min}}’$, $y_{\text{min}}’$, $x_{\text{max}}’$, $y_{\text{max}}’$], where

$$
\begin{gather}
x_{\text{min}}’ = \frac{x_{\text{min, gt}} - x_{\text{min, anchor}}}{w_{\text{anchor}}} \\
y_{\text{min}}’ = \frac{y_{\text{min, gt}} - y_{\text{min, anchor}}}{h_{\text{anchor}}} \\
x_{\text{max}}’ = \frac{x_{\text{max, gt}} - x_{\text{max, anchor}}}{w_{\text{anchor}}} \\
y_{\text{max}}’ = \frac{y_{\text{max, gt}} - y_{\text{max, anchor}}}{h_{\text{anchor}}} \\
\end{gather}
$$

MinMax Representation Encoding

Similarly, the encoded representation of a ground-truth bounding box [$x_{\text{min, gt}}$, $x_{\text{max, gt}}$, $y_{\text{min, gt}}$, $y_{\text{max, gt}}$] with the corresponding anchor box [$x_{\text{min, anchor}}$, $x_{\text{max, anchor}}$, $y_{\text{min, anchor}}$, $y_{\text{max, anchor}}$] is [$x_{\text{min}}’$, $x_{\text{max}}’$, $y_{\text{min}}’$, $y_{\text{max}}’$], where

$$
\begin{gather}
x_{\text{min}}’ = \frac{x_{\text{min, gt}} - x_{\text{min, anchor}}}{w_{\text{anchor}}} \\
x_{\text{max}}’ = \frac{x_{\text{max, gt}} - x_{\text{max, anchor}}}{w_{\text{anchor}}} \\
y_{\text{min}}’ = \frac{y_{\text{min, gt}} - y_{\text{min, anchor}}}{h_{\text{anchor}}} \\
y_{\text{max}}’ = \frac{y_{\text{max, gt}} - y_{\text{max, anchor}}}{h_{\text{anchor}}} \\
\end{gather}
$$

Representation Encoding With Variance

The above encoding methods are usually well documented in the papers such as the Faster R-CNN paper. However, when you start to read the code of object detection models, you will often start to see an “unexpected” input “variance”, such as [0.1, 0.1, 0.2, 0.2] where 0.1, 0.1, 0.2, 0.2 are for $x$, $y$, $w$, $h$ respectively, in the encoding functions, which was never mentioned in the papers. This variance input is extremely misleading. I have to admit that it took me a while to understand how it works and it is actually very simple. It should not be described that obscure in the code.

In bounding box encoding with variance, based on the bounding box encoding method described above, you will often see in the code from some thousand-star GitHub repositories, such as this one, that each encoded representation is further divided by their corresponding “variance”. For example, in centroid representation encoding with variance,

$$
\begin{gather}
x’’ = x’ / \sigma^2_{x} = \frac{x_{\text{gt}} - x_{\text{anchor}}}{w_{\text{anchor}}} / \sigma^2_{x}\\
y’’ = y’ / \sigma^2_{y} = \frac{y_{\text{gt}} - y_{\text{anchor}}}{h_{\text{anchor}}} / \sigma^2_{y} \\
w’’ = w’ / \sigma^2_{w} = \ln{\bigg[\frac{w_{\text{gt}}}{w_{\text{anchor}}}\bigg]} / \sigma^2_{w} \\
h’’ = h’ / \sigma^2_{h} = \ln{\bigg[\frac{h_{\text{gt}}}{h_{\text{anchor}}}\bigg]} / \sigma^2_{h} \\
\end{gather}
$$

where you will often see variance $\sigma_{x}^2 = 0.1$, $\sigma_{y}^2 = 0.1$, $\sigma_{w}^2 = 0.2$, $\sigma_{h}^2 = 0.2$. Although they have probably implemented the model correctly, but the way described this encoding method is often wrong and misleading, and nobody knows how those variance numbers were obtained. It also should be noted that expression $\sigma^2_{x}$ is wrong in their code comment because a random variable should not be expressed using a small letter.

In my opinion, it is actually a process of standard normalization instead of “encoding with variance”. The users first calculate the ground-truth bounding box representations according to the “Bounding Box Encoding” chapter I described above. With such many encoded ground-truth bounding box representations, you could always calculate the mean and variance of each representation. To achieve better machine learning accuracy, you would like to further normalize the representations by

$$
x’’ = \frac{x’-\mu_{X’}}{\sigma_{X’}}
$$

where $\mu_{x}$ is the mean of variable $X$ and $\sigma_{X’}$ is the standard deviation of variable $X’$. In that way, if the encoded bounding box $X’$ follows some Gaussian distribution, after normalization, the distribution would become standard normal distribution with a mean of 0 and variance of 1. This will be ideal for machine learning predictions.

In bounding box regression, $\mu_{X’} \approx 0$ in practice. Therefore we could normalize the representations by

$$
x’’ = \frac{x’}{\sigma_{X’}}
$$

So “divided by variance” is actually wrong! It should be divided by the standard deviation. If [0.1, 0.1, 0.2, 0.2] are really variance, the centroid representation encoding with variance should be

$$
\begin{gather}
x’’ = x’ / \sigma^2_{x} = \frac{x_{\text{gt}} - x_{\text{anchor}}}{w_{\text{anchor}}} / \sigma_{X’}\\
y’’ = y’ / \sigma^2_{y} = \frac{y_{\text{gt}} - y_{\text{anchor}}}{h_{\text{anchor}}} / \sigma_{Y’} \\
w’’ = w’ / \sigma^2_{w} = \ln{\bigg[\frac{w_{\text{gt}}}{w_{\text{anchor}}}\bigg]} / \sigma_{W’} \\
h’’ = h’ / \sigma^2_{h} = \ln{\bigg[\frac{h_{\text{gt}}}{h_{\text{anchor}}}\bigg]} / \sigma_{H’} \\
\end{gather}
$$

where $\sigma_{X’} = \sqrt{0.1}$, $\sigma_{Y’} = \sqrt{0.1}$, $\sigma_{W’} = \sqrt{0.2}$, $\sigma_{H’} = \sqrt{0.2}$

More concretely, the bounding box representation encoding with variance should be as follows.

Centroids Representation Encoding With Variance

The encoded representation of a ground-truth bounding box [$x_{\text{gt}}$, $y_{\text{gt}}$, $w_{\text{gt}}$, $h_{\text{gt}}$] with the corresponding anchor box [$x_{\text{anchor}}$, $y_{\text{anchor}}$, $w_{\text{anchor}}$, $h_{\text{anchor}}$] is [$x’’$, $y’’$, $w’’$, $h’’$], where

$$
\begin{gather}
x’’ = x’ / \sigma_{X’}\\
y’’ = x’ / \sigma_{Y’}\\
w’’ = w’ / \sigma_{W’} \\
h’’ = h’ / \sigma_{H’} \\
\end{gather}
$$

and the standard deviations were calculated from the centroids representation encodings without variance [$x’$, $y’$, $w’$, $h’$] in the training dataset.

Corners Representation Encoding With Variance

The encoded representation of a ground-truth bounding box [$x_{\text{min, gt}}$, $y_{\text{min, gt}}$, $x_{\text{max, gt}}$, $y_{\text{max, gt}}$] with the corresponding anchor box [$x_{\text{min, anchor}}$, $y_{\text{min, anchor}}$, $x_{\text{max, anchor}}$, $y_{\text{max, anchor}}$] is [$x_{\text{min}}’’$, $y_{\text{min}}’’$, $x_{\text{max}}’’$, $y_{\text{max}}’’$], where

$$
\begin{gather}
x_{\text{min}}’’ = x_{\text{min}}’ / \sigma_{X_{\text{min}}’}\\
y_{\text{min}}’’ = y_{\text{min}}’ / \sigma_{Y_{\text{min}}’}\\
x_{\text{max}}’’ = x_{\text{max}}’ / \sigma_{X_{\text{max}}’}\\
y_{\text{max}}’’ = y_{\text{max}}’ / \sigma_{Y_{\text{min}}’} \\
\end{gather}
$$

and the standard deviations were calculated from the centroids representation encodings without variance [$x_{\text{min}}’$, $y_{\text{min}}’$, $x_{\text{max}}’$, $y_{\text{max}}’$] in the training dataset.

MinMax Representation Encoding With Variance

The encoded representation of a ground-truth bounding box [$x_{\text{min, gt}}$, $x_{\text{max, gt}}$, $y_{\text{min, gt}}$, $y_{\text{max, gt}}$] with the corresponding anchor box [$x_{\text{min, anchor}}$, $y_{\text{min, anchor}}$, $x_{\text{max, anchor}}$, $y_{\text{max, anchor}}$] is [$x_{\text{min}}’’$, $x_{\text{max}}’’$, $y_{\text{min}}’’$, $y_{\text{max}}’’$], where

$$
\begin{gather}
x_{\text{min}}’’ = x_{\text{min}}’ / \sigma_{X_{\text{min}}’}\\
x_{\text{max}}’’ = x_{\text{max}}’ / \sigma_{X_{\text{max}}’}\\
y_{\text{min}}’’ = y_{\text{min}}’ / \sigma_{Y_{\text{min}}’}\\
y_{\text{max}}’’ = y_{\text{max}}’ / \sigma_{Y_{\text{min}}’} \\
\end{gather}
$$

and the standard deviations were calculated from the centroids representation encodings without variance [$x_{\text{min}}’$, $x_{\text{max}}’$, $y_{\text{min}}’$, $y_{\text{max}}’$] in the training dataset.

Bounding Box Decoding

Once you know how the bounding box encoding works, it is very easy to do bounding box decoding during inference.

Centroids Representation Decoding With Variance

The decoded representation of a predicted bounding box [$x_{\text{pred}}’’$, $y_{\text{pred}}’’$, $w_{\text{pred}}’’$, $h_{\text{pred}}’’$] with the corresponding anchor box [$x_{\text{anchor}}$, $y_{\text{anchor}}$, $w_{\text{anchor}}$, $h_{\text{anchor}}$] and pre-calculated variances [$\sigma_{X’}^2$, $\sigma_{Y’}^2$, $\sigma_{W’}^2$, $\sigma_{H’}^2$] is [$x_{\text{pred}}$, $y_{\text{pred}}$, $w_{\text{pred}}$, $h_{\text{pred}}$], where

$$
\begin{gather}
x_{\text{pred}} = x_{\text{pred}}’’ \sigma_{X’} w_{\text{anchor}} + x_{\text{anchor}} \\
y_{\text{pred}} = y_{\text{pred}}’’ \sigma_{Y’} h_{\text{anchor}} + y_{\text{anchor}} \\
w_{\text{pred}} = \exp({w_{\text{pred}}’’ \sigma_{W’}}) w_{\text{anchor}} \\
h_{\text{pred}} = \exp({h_{\text{pred}}’’ \sigma_{H’}}) h_{\text{anchor}} \\
\end{gather}
$$

Corners Representation Decoding With Variance

The decoded representation of a predicted bounding box [$x_{\text{min, pred}}’’$, $y_{\text{min, pred}}’’$, $x_{\text{max, pred}}’’$, $y_{\text{max, pred}}’’$] with the corresponding anchor box [$x_{\text{min, anchor}}$, $y_{\text{min, anchor}}$, $x_{\text{max, anchor}}$, $y_{\text{max, anchor}}$] and pre-calculated variances [$\sigma_{X_{\text{min}}’}^2$, $\sigma_{Y_{\text{min}}’}^2$, $\sigma_{X_{\text{max}}’}^2$, $\sigma_{X_{\text{max}}’}^2$] is [$x_{\text{min, pred}}$, $y_{\text{min, pred}}$, $x_{\text{max, pred}}$, $y_{\text{max, pred}}$], where

$$
\begin{gather}
x_{\text{min, pred}} = x_{\text{min, pred}}’’ \sigma_{X_{\text{min}}’} w_{\text{anchor}} + x_{\text{min, anchor}} \\
y_{\text{min, pred}} = y_{\text{min, pred}}’’ \sigma_{Y_{\text{min}}’} h_{\text{anchor}} + y_{\text{min, anchor}} \\
x_{\text{max, pred}} = x_{\text{max, pred}}’’ \sigma_{X_{\text{max}}’} w_{\text{anchor}} + x_{\text{max, anchor}} \\
y_{\text{max, pred}} = y_{\text{max, pred}}’’ \sigma_{Y_{\text{max}}’} h_{\text{anchor}} + y_{\text{max, anchor}} \\
\end{gather}
$$

MinMax Representation Decoding With Variance

The decoded representation of a predicted bounding box [$x_{\text{min, pred}}’’$, $x_{\text{max, pred}}’’$, $y_{\text{min, pred}}’’$, $y_{\text{max, pred}}’’$] with the corresponding anchor box [$x_{\text{min, anchor}}$, $x_{\text{max, anchor}}$, $y_{\text{min, anchor}}$, $y_{\text{max, anchor}}$] and pre-calculated variances [$\sigma_{X_{\text{min}}’}^2$, $\sigma_{X_{\text{max}}’}^2$, $\sigma_{Y_{\text{min}}’}^2$, $\sigma_{X_{\text{max}}’}^2$] is [$x_{\text{min, pred}}$, $x_{\text{max, pred}}$, $y_{\text{min, pred}}$, $y_{\text{max, pred}}$], where

$$
\begin{gather}
x_{\text{min, pred}} = x_{\text{min, pred}}’’ \sigma_{X_{\text{min}}’} w_{\text{anchor}} + x_{\text{min, anchor}} \\
x_{\text{max, pred}} = x_{\text{max, pred}}’’ \sigma_{X_{\text{max}}’} w_{\text{anchor}} + x_{\text{max, anchor}} \\
y_{\text{min, pred}} = y_{\text{min, pred}}’’ \sigma_{Y_{\text{min}}’} h_{\text{anchor}} + y_{\text{min, anchor}} \\
y_{\text{max, pred}} = y_{\text{max, pred}}’’ \sigma_{Y_{\text{max}}’} h_{\text{anchor}} + y_{\text{max, anchor}} \\
\end{gather}
$$

Final Remarks

Even if the normalization was conducted using the incorrect standard deviation, the normal distribution after “incorrect” normalization will still be normal. The only difference is that the variance of the distribution after normalization will not be 1. But the mean will still be roughly zero. So the effect of normalization using the incorrect standard deviation is small or even negligible, and that is why those implementation GitHub is conceptually incorrect but still works well in practice.

It is very funny to see the error propagates because of the lack of good documentation.

Bounding Box Encoding and Decoding in Object Detection

https://leimao.github.io/blog/Bounding-Box-Encoding-Decoding/

Author

Lei Mao

Posted on

04-08-2019

Updated on

04-08-2019

Licensed under


Comments