# YOLO v2 Reorg Layer Explained

## Introduction

Object detection model YOLO v2 or Darknet has used a “reorg” layer in the network. However, the author did not talked about this new architecture at all in the paper. There are some blog posts explaining the reorg layer, such as this one and this one. Unfortunately, all of them are superficial. When you actually start to read the source code and compare with the illustration on those blog posts, you will feel even more confused. So I spent a while studying the behavior of this reorg layer and documented how this reorg layer really works and how to make it work in your own network.

## Purpose of Reorg Layer

I think the purpose of this reorg layer is very simple, we want to combine middle-level features and high-level features and getter classification accuracy via different ways, such as max-pooling and addition. In YOLO v2, the author used a reorg layer to reshape the output tensor, so that the H and W of the tensor match the other output tensor downstream, and these two tensor outputs could be concatenated together.

## How Reorg Layer Works

### Basic Goals

The original implementation of reorg layer takes a tensor of shape $[N, C, H, W]$, where N is the batch size, C is the number of channels, H is the number of rows, and W is the number of columns, re-arrange the order of tensor elements, reshape and outputs a tensor whose shape is either $[N, C \times s^2, \frac{H}{s}, \frac{W}{s}]$ or $[N, \frac{C}{s^2}, H\times s, W\times s]$, where $s$ is a positive integer and $C \times s^2$ or $H\times$ and $W\times s$ have to be positive integers as well. Although that layer supports outputting two tensor shapes, for the particular YOLO v2 network architecture, tensor of shape $[N, C \times s^2, \frac{H}{s}, \frac{W}{s}]$ is generated from the reorg layer in the forward propagation, and tensor of shape $[N, \frac{C}{s^2}, H\times s, W\times s]$ is generated from the reorg layer in the backward propagation. The input tensor and output tensor in this implementation also stick to `NCHW`

format.

### Confusions

In this blog post, the author only used single-channeled tensor (matrix) to illustrate. It did not talk anything about the batch and the multiple-channel scenario, which you will definitely see in real neural networks. If you read the author’s source code, you would also find that it does not support the scenario where $c < s^2$. In this blog post, the authors tried to incorporate multiple channels. Unfortunately, all of these blog posts are incorrect.

If you check the original implementation of reorg layer, you will find there is an argument of `forward`

. What is this `forward`

? Do we use `forward = True`

in forward propagation and use `forward = False`

in back propagation? The author never mentioned it anywhere.

1 | void reorg_cpu(float *x, int w, int h, int c, int batch, int stride, int forward, float *out) |

### Basic Ideas

Regardless of the shape of tensor, they lay on memory as a linear region. The reorg layer is a function that sets up bijection between the order of the elements in the input tensor and the order of the elements in the output tensor. That is to say, given the position of an element from input tensor on the linear memory, there is a unique position from the output tensor on the linear memory mapping to that position. In addition to the size of flattened input tensor, this bijective function is also determined by parameters `w`

, `h`

, `c`

, `batch`

, `stride`

, and `forward`

.

### How to Use Reorg Layer

I have written a simple Python program simulating the reorg process. After the simulation, the following conclusions were drawn. The migrated reorg function in Python is shown below. The whole script could be downloaded from my GitHub Gist.

1 | def reorg(arrayIn, batch, C, H, W, stride, forward=False): |

The `forward`

argument in the reorg layer has nothing to do with forward propagation and back propagation. When the tensor is reorganized from shape $[N, C, H, W]$ to shape $[N, \frac{C}{s^2}, H\times s, W\times s]$ (channel decrease), it is called `forward = True`

. when the tensor is reorganized from shape $[N, C, H, W]$ to shape $[N, C \times s^2, \frac{H}{s}, \frac{W}{s}]$ (channel increase), it is called `forward = False`

. So in YOLO v2, in the forward propagation, we should use `forward = False`

because we increase the number of channels; while in the back propagation, we should use `forward = True`

How about the `C`

, `H`

, and `W`

argument? In the author’s implementation, `C`

, `H`

, `W`

is always the shape of input tensor in the forward propagation.

1 | # First backward (channel increase) then forward (channel decrease) |

In that way, the behavior of the reorganization of reorg layer when `forward = False`

in YOLO v2 has no effect of “grid division” and is not expected by those blog posts (see `backward_forward_author()`

). However, the behavior of the reorganization of the reorg layer when `forward = True`

is expected by those blog posts (see `forward_backward_author()`

). I am not sure if the original author has realized that the behaviors of reorganization would be different if using the shape of input tensor as arguments of reorg layer (Please also check the Concrete Example below).

Here I proposed a way so that the behaviors of reorganization would be unified. The choice of `C`

, `H`

, `W`

used for reorg layer should be obtained from the tensor whose number of channels is largest regardless if the tensor is input tensor or output tensor. These parameters could be determined once the network architecture is determined.

1 | # First backward (channel increase) then forward (channel decrease) |

It should be noted that the `forward_backward_leimao()`

function is actually the same as `forward_backward_author()`

.

For example, if we have the input tensor of shape [16, 64, 144, 144], we want to have stride = 2 and get an output tensor of shape [16, 256, 72, 72] from reorg layer in the forward propagation. In the forward propagation, we should call `reorg(arrayIn, batch=16, C=256, H=72, W=72, stride=2, forward=False)`

, where `arrayIn`

is the input tensor in the forward propagation of shape [16, 64, 144, 144]. In the backward propagation to calculate the gradients, we should call `reorg(arrayIn, batch=16, C=256, H=72, W=72, stride=2, forward=True)`

, where `arrayIn`

is the input tensor in the backward propagation of shape [16, 256, 72, 72].

## Concrete Example

I prepared a concrete minimal example showing how reorg layers worked given the parameter settings are the ones mentioned above. Please run the script on my GitHub Gist to see how those implementations make a difference. Here I am showing the backward reorganization of tensor, which is also used in the forward propagation in YOLO v2.

### Input

We have an input of shape [2, 4, 6, 6]:

1 | [[[[ 0 1 2 3 4 5] |

### Author’s Reorg Result

The output tensor of shape [2, 16, 3, 3] from the author’s reorg layer:

1 | [[[[ 0 2 4] |

This result is not the same as what the public people were expecting, which means that all those people were wrong.

### My Reorg Result

The output tensor of shape [2, 16, 3, 3] from my reorg layer:

1 | [[[[ 0 2 4] |

This result meets the expectation from the public. Although it looked “elegant”, it is different from the behavior of the reorg layer which the original author was using.

### Reorg Mechanism?

I apologize I did not make a figure on this because it is too tedious. If someone could make one for me, I would be very appreciated. To the best of my knowledge, this is the only blog post showing how reorg layer works correctly.

## Final Remarks

At first, I thought this process is a reverse process of the sub-pixel convolutional layer. But it looks like that it is slightly different and more complicated. I am not sure if the YOLO v2 implementations using frameworks such as TensorFlow implemented this reorg layer correctly because people tend to use built-in functions such as `tf.depth_to_space`

and I believe these functions are slightly different from what the original reorg layer does. However, this is not probably not going to affect the performance of neural networks since there will be no information loss if you implement reorg layer differently and our neural network is usually smart enough to figure out how to extract useful information from the right places. Probably the author also made the error in implementation as I mentioned above so that the reorganization behavior was not expected by the people including myself, but the neural network could “correct” this “error”.

If I did make a mistake in understanding the author’s reorganization layer implementation, please do let me know.

YOLO v2 Reorg Layer Explained