Object detection model YOLO v2 or Darknet has used a “reorg” layer in the network. However, the author did not talked about this new architecture at all in the paper. There are some blog posts explaining the reorg layer, such as this one and this one. Unfortunately, all of them are superficial. When you actually start to read the source code and compare with the illustration on those blog posts, you will feel even more confused. So I spent a while studying the behavior of this reorg layer and documented how this reorg layer really works and how to make it work in your own network.
Purpose of Reorg Layer
I think the purpose of this reorg layer is very simple, we want to combine middle-level features and high-level features and getter classification accuracy via different ways, such as max-pooling and addition. In YOLO v2, the author used a reorg layer to reshape the output tensor, so that the H and W of the tensor match the other output tensor downstream, and these two tensor outputs could be concatenated together.
How Reorg Layer Works
Basic Goals
The original implementation of reorg layer takes a tensor of shape $[N, C, H, W]$, where N is the batch size, C is the number of channels, H is the number of rows, and W is the number of columns, re-arrange the order of tensor elements, reshape and outputs a tensor whose shape is either $[N, C \times s^2, \frac{H}{s}, \frac{W}{s}]$ or $[N, \frac{C}{s^2}, H\times s, W\times s]$, where $s$ is a positive integer and $C \times s^2$ or $H\times$ and $W\times s$ have to be positive integers as well. Although that layer supports outputting two tensor shapes, for the particular YOLO v2 network architecture, tensor of shape $[N, C \times s^2, \frac{H}{s}, \frac{W}{s}]$ is generated from the reorg layer in the forward propagation, and tensor of shape $[N, \frac{C}{s^2}, H\times s, W\times s]$ is generated from the reorg layer in the backward propagation. The input tensor and output tensor in this implementation also stick to NCHW format.
Confusions
In this blog post, the author only used single-channeled tensor (matrix) to illustrate. It did not talk anything about the batch and the multiple-channel scenario, which you will definitely see in real neural networks. If you read the author’s source code, you would also find that it does not support the scenario where $c < s^2$. In this blog post, the authors tried to incorporate multiple channels. Unfortunately, all of these blog posts are incorrect.
If you check the original implementation of reorg layer, you will find there is an argument of forward. What is this forward? Do we use forward = True in forward propagation and use forward = False in back propagation? The author never mentioned it anywhere.
voidreorg_cpu(float *x, int w, int h, int c, int batch, int stride, int forward, float *out) { int b,i,j,k; int out_c = c/(stride*stride);
for(b = 0; b < batch; ++b){ for(k = 0; k < c; ++k){ for(j = 0; j < h; ++j){ for(i = 0; i < w; ++i){ int in_index = i + w*(j + h*(k + c*b)); int c2 = k % out_c; int offset = k / out_c; int w2 = i*stride + offset % stride; int h2 = j*stride + offset / stride; int out_index = w2 + w*stride*(h2 + h*stride*(c2 + out_c*b)); if(forward) out[out_index] = x[in_index]; else out[in_index] = x[out_index]; } } } } }
Basic Ideas
Regardless of the shape of tensor, they lay on memory as a linear region. The reorg layer is a function that sets up bijection between the order of the elements in the input tensor and the order of the elements in the output tensor. That is to say, given the position of an element from input tensor on the linear memory, there is a unique position from the output tensor on the linear memory mapping to that position. In addition to the size of flattened input tensor, this bijective function is also determined by parameters w, h, c, batch, stride, and forward.
How to Use Reorg Layer
I have written a simple Python program simulating the reorg process. After the simulation, the following conclusions were drawn. The migrated reorg function in Python is shown below. The whole script could be downloaded from my GitHub Gist.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
defreorg(arrayIn, batch, C, H, W, stride, forward=False): arrayLen = len(arrayIn) arrayOut = np.zeros(arrayLen) out_c = C//(stride*stride) for b inrange(batch): for k inrange(C): for j inrange(H): for i inrange(W): in_index = i + W*(j + H*(k + C*b)) c2 = k % out_c offset = k // out_c w2 = i*stride + offset % stride h2 = j*stride + offset // stride out_index = int(w2 + W*stride*(h2 + H*stride*(c2 + out_c*b))) if forward: arrayOut[out_index] = arrayIn[in_index] else: arrayOut[in_index] = arrayIn[out_index] return arrayOut
The forward argument in the reorg layer has nothing to do with forward propagation and back propagation. When the tensor is reorganized from shape $[N, C, H, W]$ to shape $[N, \frac{C}{s^2}, H\times s, W\times s]$ (channel decrease), it is called forward = True. when the tensor is reorganized from shape $[N, C, H, W]$ to shape $[N, C \times s^2, \frac{H}{s}, \frac{W}{s}]$ (channel increase), it is called forward = False. So in YOLO v2, in the forward propagation, we should use forward = False because we increase the number of channels; while in the back propagation, we should use forward = True
How about the C, H, and W argument? In the author’s implementation, C, H, W is always the shape of input tensor in the forward propagation.
# First backward (channel increase) then forward (channel decrease) # Mimicking author's implementation # Reorganization reversible # Reorganization result was not expected by the public defbackward_forward_author():
# First forward (channel decrease) then backward (channel increase) # Reorganization reversible # Reorganization result was expected by the public defforward_backward_author():
In that way, the behavior of the reorganization of reorg layer when forward = False in YOLO v2 has no effect of “grid division” and is not expected by those blog posts (see backward_forward_author()). However, the behavior of the reorganization of the reorg layer when forward = True is expected by those blog posts (see forward_backward_author()). I am not sure if the original author has realized that the behaviors of reorganization would be different if using the shape of input tensor as arguments of reorg layer (Please also check the Concrete Example below).
Here I proposed a way so that the behaviors of reorganization would be unified. The choice of C, H, W used for reorg layer should be obtained from the tensor whose number of channels is largest regardless if the tensor is input tensor or output tensor. These parameters could be determined once the network architecture is determined.
# First backward (channel increase) then forward (channel decrease) # Reorganization reversible # Reorganization result was expected by the public defbackward_forward_leimao():
# First forward (channel decrease) then backward (channel increase) # Reorganization reversible # Reorganization result was expected by the public defforward_backward_leimao():
It should be noted that the forward_backward_leimao() function is actually the same as forward_backward_author().
For example, if we have the input tensor of shape [16, 64, 144, 144], we want to have stride = 2 and get an output tensor of shape [16, 256, 72, 72] from reorg layer in the forward propagation. In the forward propagation, we should call reorg(arrayIn, batch=16, C=256, H=72, W=72, stride=2, forward=False), where arrayIn is the input tensor in the forward propagation of shape [16, 64, 144, 144]. In the backward propagation to calculate the gradients, we should call reorg(arrayIn, batch=16, C=256, H=72, W=72, stride=2, forward=True), where arrayIn is the input tensor in the backward propagation of shape [16, 256, 72, 72].
Concrete Example
I prepared a concrete minimal example showing how reorg layers worked given the parameter settings are the ones mentioned above. Please run the script on my GitHub Gist to see how those implementations make a difference. Here I am showing the backward reorganization of tensor, which is also used in the forward propagation in YOLO v2.
This result meets the expectation from the public. Although it looked “elegant”, it is different from the behavior of the reorg layer which the original author was using.
Reorg Mechanism?
I apologize I did not make a figure on this because it is too tedious. If someone could make one for me, I would be very appreciated. To the best of my knowledge, this is the only blog post showing how reorg layer works correctly.
Final Remarks
At first, I thought this process is a reverse process of the sub-pixel convolutional layer. But it looks like that it is slightly different and more complicated. I am not sure if the YOLO v2 implementations using frameworks such as TensorFlow implemented this reorg layer correctly because people tend to use built-in functions such as tf.depth_to_space and I believe these functions are slightly different from what the original reorg layer does. However, this is not probably not going to affect the performance of neural networks since there will be no information loss if you implement reorg layer differently and our neural network is usually smart enough to figure out how to extract useful information from the right places. Probably the author also made the error in implementation as I mentioned above so that the reorganization behavior was not expected by the people including myself, but the neural network could “correct” this “error”.
If I did make a mistake in understanding the author’s reorganization layer implementation, please do let me know.