YOLO v2 Reorg Layer Explained

Introduction

Object detection model YOLO v2 or Darknet has used a “reorg” layer in the network. However, the author did not talked about this new architecture at all in the paper. There are some blog posts explaining the reorg layer, such as this one and this one. Unfortunately, all of them are superficial. When you actually start to read the source code and compare with the illustration on those blog posts, you will feel even more confused. So I spent a while studying the behavior of this reorg layer and documented how this reorg layer really works and how to make it work in your own network.

Purpose of Reorg Layer

I think the purpose of this reorg layer is very simple, we want to combine middle-level features and high-level features and getter classification accuracy via different ways, such as max-pooling and addition. In YOLO v2, the author used a reorg layer to reshape the output tensor, so that the H and W of the tensor match the other output tensor downstream, and these two tensor outputs could be concatenated together.

How Reorg Layer Works

Basic Goals

The original implementation of reorg layer takes a tensor of shape $[N, C, H, W]$, where N is the batch size, C is the number of channels, H is the number of rows, and W is the number of columns, re-arrange the order of tensor elements, reshape and outputs a tensor whose shape is either $[N, C \times s^2, \frac{H}{s}, \frac{W}{s}]$ or $[N, \frac{C}{s^2}, H\times s, W\times s]$, where $s$ is a positive integer and $C \times s^2$ or $H\times$ and $W\times s$ have to be positive integers as well. Although that layer supports outputting two tensor shapes, for the particular YOLO v2 network architecture, tensor of shape $[N, C \times s^2, \frac{H}{s}, \frac{W}{s}]$ is generated from the reorg layer in the forward propagation, and tensor of shape $[N, \frac{C}{s^2}, H\times s, W\times s]$ is generated from the reorg layer in the backward propagation. The input tensor and output tensor in this implementation also stick to NCHW format.

Confusions

In this blog post, the author only used single-channeled tensor (matrix) to illustrate. It did not talk anything about the batch and the multiple-channel scenario, which you will definitely see in real neural networks. If you read the author’s source code, you would also find that it does not support the scenario where $c < s^2$. In this blog post, the authors tried to incorporate multiple channels. Unfortunately, all of these blog posts are incorrect.

If you check the original implementation of reorg layer, you will find there is an argument of forward. What is this forward? Do we use forward = True in forward propagation and use forward = False in back propagation? The author never mentioned it anywhere.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
void reorg_cpu(float *x, int w, int h, int c, int batch, int stride, int forward, float *out)
{
int b,i,j,k;
int out_c = c/(stride*stride);

for(b = 0; b < batch; ++b){
for(k = 0; k < c; ++k){
for(j = 0; j < h; ++j){
for(i = 0; i < w; ++i){
int in_index = i + w*(j + h*(k + c*b));
int c2 = k % out_c;
int offset = k / out_c;
int w2 = i*stride + offset % stride;
int h2 = j*stride + offset / stride;
int out_index = w2 + w*stride*(h2 + h*stride*(c2 + out_c*b));
if(forward) out[out_index] = x[in_index];
else out[in_index] = x[out_index];
}
}
}
}
}

Basic Ideas

Regardless of the shape of tensor, they lay on memory as a linear region. The reorg layer is a function that sets up bijection between the order of the elements in the input tensor and the order of the elements in the output tensor. That is to say, given the position of an element from input tensor on the linear memory, there is a unique position from the output tensor on the linear memory mapping to that position. In addition to the size of flattened input tensor, this bijective function is also determined by parameters w, h, c, batch, stride, and forward.

How to Use Reorg Layer

I have written a simple Python program simulating the reorg process. After the simulation, the following conclusions were drawn. The migrated reorg function in Python is shown below. The whole script could be downloaded from my GitHub Gist.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
def reorg(arrayIn, batch, C, H, W, stride, forward=False):
arrayLen = len(arrayIn)
arrayOut = np.zeros(arrayLen)
out_c = C//(stride*stride)
for b in range(batch):
for k in range(C):
for j in range(H):
for i in range(W):
in_index = i + W*(j + H*(k + C*b))
c2 = k % out_c
offset = k // out_c
w2 = i*stride + offset % stride
h2 = j*stride + offset // stride
out_index = int(w2 + W*stride*(h2 + H*stride*(c2 + out_c*b)))
if forward:
arrayOut[out_index] = arrayIn[in_index]
else:
arrayOut[in_index] = arrayIn[out_index]
return arrayOut

The forward argument in the reorg layer has nothing to do with forward propagation and back propagation. When the tensor is reorganized from shape $[N, C, H, W]$ to shape $[N, \frac{C}{s^2}, H\times s, W\times s]$ (channel decrease), it is called forward = True. when the tensor is reorganized from shape $[N, C, H, W]$ to shape $[N, C \times s^2, \frac{H}{s}, \frac{W}{s}]$ (channel increase), it is called forward = False. So in YOLO v2, in the forward propagation, we should use forward = False because we increase the number of channels; while in the back propagation, we should use forward = True

How about the C, H, and W argument? In the author’s implementation, C, H, W is always the shape of input tensor in the forward propagation.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
# First backward (channel increase) then forward (channel decrease)
# Mimicking author's implementation
# Reorganization reversible
# Reorganization result was not expected by the public
def backward_forward_author():

stride = stride_MACRO

input_batch = batch_MACRO
input_C = C_MACRO
input_H = H_MACRO
input_W = W_MACRO

output_batch = input_batch
output_C = input_C * stride * stride
output_H = input_H // stride
output_W = input_W // stride

# Parameters for reorg function
reorg_stride = stride_MACRO
reorg_batch = batch_MACRO
reorg_C = input_C
reorg_H = input_H
reorg_W = input_W

matrix1_linear = np.array(list(range(input_batch * input_C * input_H * input_W))).astype(np.int)
matrix1 = np.reshape(matrix1_linear, (input_batch, input_C, input_H, input_W))

# print(matrix1[0,0,:,:])
print(matrix1)

# Reorg

matrix2_linear = reorg(matrix1_linear, batch=reorg_batch, C=reorg_C, H=reorg_H, W=reorg_W, stride=reorg_stride, forward=False).astype(np.int)
matrix2 = np.reshape(matrix2_linear, (output_batch, output_C, output_H, output_W))

# print(matrix2[0,0,:,:])
print(matrix2)

# Reorg back
# matrix3 should be identical to matrix1

matrix3_linear = reorg(matrix2_linear, batch=reorg_batch, C=reorg_C, H=reorg_H, W=reorg_W, stride=reorg_stride, forward=True).astype(np.int)
matrix3 = np.reshape(matrix3_linear, (input_batch, input_C, input_H, input_W))

# print(matrix3[0,0,:,:])
print(matrix3)

assert(np.array_equal(matrix1, matrix3))

# First forward (channel decrease) then backward (channel increase)
# Reorganization reversible
# Reorganization result was expected by the public
def forward_backward_author():

stride = stride_MACRO

input_batch = batch_MACRO
input_C = C_MACRO
input_H = H_MACRO
input_W = W_MACRO

output_batch = input_batch
output_C = input_C // stride // stride
output_H = input_H * stride
output_W = input_W * stride

# Parameters for reorg function
reorg_stride = stride_MACRO
reorg_batch = batch_MACRO
reorg_C = input_C
reorg_H = input_H
reorg_W = input_W

matrix1_linear = np.array(list(range(input_batch * input_C * input_H * input_W))).astype(np.int)
matrix1 = np.reshape(matrix1_linear, (input_batch, input_C, input_H, input_W))

# print(matrix1[0,0,:,:])
print(matrix1)

# Reorg

matrix2_linear = reorg(matrix1_linear, batch=reorg_batch, C=reorg_C, H=reorg_H, W=reorg_W, stride=reorg_stride, forward=True).astype(np.int)
matrix2 = np.reshape(matrix2_linear, (output_batch, output_C, output_H, output_W))

# print(matrix2[0,0,:,:])
print(matrix2)

# Reorg back
# matrix3 should be identical to matrix1

matrix3_linear = reorg(matrix2_linear, batch=reorg_batch, C=reorg_C, H=reorg_H, W=reorg_W, stride=reorg_stride, forward=False).astype(np.int)
matrix3 = np.reshape(matrix3_linear, (input_batch, input_C, input_H, input_W))

# print(matrix3[0,0,:,:])
print(matrix3)

assert(np.array_equal(matrix1, matrix3))

In that way, the behavior of the reorganization of reorg layer when forward = False in YOLO v2 has no effect of “grid division” and is not expected by those blog posts (see backward_forward_author()). However, the behavior of the reorganization of the reorg layer when forward = True is expected by those blog posts (see forward_backward_author()). I am not sure if the original author has realized that the behaviors of reorganization would be different if using the shape of input tensor as arguments of reorg layer (Please also check the Concrete Example below).

Here I proposed a way so that the behaviors of reorganization would be unified. The choice of C, H, W used for reorg layer should be obtained from the tensor whose number of channels is largest regardless if the tensor is input tensor or output tensor. These parameters could be determined once the network architecture is determined.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
# First backward (channel increase) then forward (channel decrease)
# Reorganization reversible
# Reorganization result was expected by the public
def backward_forward_leimao():

stride = stride_MACRO

input_batch = batch_MACRO
input_C = C_MACRO
input_H = H_MACRO
input_W = W_MACRO

output_batch = input_batch
output_C = input_C * stride * stride
output_H = input_H // stride
output_W = input_W // stride

# Parameters for reorg function
use_output_shape = True if output_C >= input_C else False
reorg_stride = stride_MACRO
reorg_batch = batch_MACRO
reorg_C = output_C if use_output_shape else input_C
reorg_H = output_H if use_output_shape else input_H
reorg_W = output_W if use_output_shape else input_W

matrix1_linear = np.array(list(range(input_batch * input_C * input_H * input_W))).astype(np.int)
matrix1 = np.reshape(matrix1_linear, (input_batch, input_C, input_H, input_W))

# print(matrix1[0,0,:,:])
print(matrix1)

# Reorg

matrix2_linear = reorg(matrix1_linear, batch=reorg_batch, C=reorg_C, H=reorg_H, W=reorg_W, stride=reorg_stride, forward=False).astype(np.int)
matrix2 = np.reshape(matrix2_linear, (output_batch, output_C, output_H, output_W))

# print(matrix2[0,0,:,:])
print(matrix2)

# Reorg back
# matrix3 should be identical to matrix1

matrix3_linear = reorg(matrix2_linear, batch=reorg_batch, C=reorg_C, H=reorg_H, W=reorg_W, stride=reorg_stride, forward=True).astype(np.int)
matrix3 = np.reshape(matrix3_linear, (input_batch, input_C, input_H, input_W))

# print(matrix3[0,0,:,:])
print(matrix3)

assert(np.array_equal(matrix1, matrix3))


# First forward (channel decrease) then backward (channel increase)
# Reorganization reversible
# Reorganization result was expected by the public
def forward_backward_leimao():

stride = stride_MACRO

input_batch = batch_MACRO
input_C = C_MACRO
input_H = H_MACRO
input_W = W_MACRO

output_batch = input_batch
output_C = input_C // stride // stride
output_H = input_H * stride
output_W = input_W * stride

# Parameters for reorg function
use_output_shape = True if output_C >= input_C else False
reorg_stride = stride_MACRO
reorg_batch = batch_MACRO
reorg_C = output_C if use_output_shape else input_C
reorg_H = output_H if use_output_shape else input_H
reorg_W = output_W if use_output_shape else input_W

matrix1_linear = np.array(list(range(input_batch * input_C * input_H * input_W))).astype(np.int)
matrix1 = np.reshape(matrix1_linear, (input_batch, input_C, input_H, input_W))

# print(matrix1[0,0,:,:])
print(matrix1)

# Reorg

matrix2_linear = reorg(matrix1_linear, batch=reorg_batch, C=reorg_C, H=reorg_H, W=reorg_W, stride=reorg_stride, forward=True).astype(np.int)
matrix2 = np.reshape(matrix2_linear, (output_batch, output_C, output_H, output_W))

# print(matrix2[0,0,:,:])
print(matrix2)

# Reorg back
# matrix3 should be identical to matrix1

matrix3_linear = reorg(matrix2_linear, batch=reorg_batch, C=reorg_C, H=reorg_H, W=reorg_W, stride=reorg_stride, forward=False).astype(np.int)
matrix3 = np.reshape(matrix3_linear, (input_batch, input_C, input_H, input_W))

# print(matrix3[0,0,:,:])
print(matrix3)

assert(np.array_equal(matrix1, matrix3))

It should be noted that the forward_backward_leimao() function is actually the same as forward_backward_author().

For example, if we have the input tensor of shape [16, 64, 144, 144], we want to have stride = 2 and get an output tensor of shape [16, 256, 72, 72] from reorg layer in the forward propagation. In the forward propagation, we should call reorg(arrayIn, batch=16, C=256, H=72, W=72, stride=2, forward=False), where arrayIn is the input tensor in the forward propagation of shape [16, 64, 144, 144]. In the backward propagation to calculate the gradients, we should call reorg(arrayIn, batch=16, C=256, H=72, W=72, stride=2, forward=True), where arrayIn is the input tensor in the backward propagation of shape [16, 256, 72, 72].

Concrete Example

I prepared a concrete minimal example showing how reorg layers worked given the parameter settings are the ones mentioned above. Please run the script on my GitHub Gist to see how those implementations make a difference. Here I am showing the backward reorganization of tensor, which is also used in the forward propagation in YOLO v2.

Input

We have an input of shape [2, 4, 6, 6]:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
[[[[  0   1   2   3   4   5]
[ 6 7 8 9 10 11]
[ 12 13 14 15 16 17]
[ 18 19 20 21 22 23]
[ 24 25 26 27 28 29]
[ 30 31 32 33 34 35]]

[[ 36 37 38 39 40 41]
[ 42 43 44 45 46 47]
[ 48 49 50 51 52 53]
[ 54 55 56 57 58 59]
[ 60 61 62 63 64 65]
[ 66 67 68 69 70 71]]

[[ 72 73 74 75 76 77]
[ 78 79 80 81 82 83]
[ 84 85 86 87 88 89]
[ 90 91 92 93 94 95]
[ 96 97 98 99 100 101]
[102 103 104 105 106 107]]

[[108 109 110 111 112 113]
[114 115 116 117 118 119]
[120 121 122 123 124 125]
[126 127 128 129 130 131]
[132 133 134 135 136 137]
[138 139 140 141 142 143]]]


[[[144 145 146 147 148 149]
[150 151 152 153 154 155]
[156 157 158 159 160 161]
[162 163 164 165 166 167]
[168 169 170 171 172 173]
[174 175 176 177 178 179]]

[[180 181 182 183 184 185]
[186 187 188 189 190 191]
[192 193 194 195 196 197]
[198 199 200 201 202 203]
[204 205 206 207 208 209]
[210 211 212 213 214 215]]

[[216 217 218 219 220 221]
[222 223 224 225 226 227]
[228 229 230 231 232 233]
[234 235 236 237 238 239]
[240 241 242 243 244 245]
[246 247 248 249 250 251]]

[[252 253 254 255 256 257]
[258 259 260 261 262 263]
[264 265 266 267 268 269]
[270 271 272 273 274 275]
[276 277 278 279 280 281]
[282 283 284 285 286 287]]]]

Author’s Reorg Result

The output tensor of shape [2, 16, 3, 3] from the author’s reorg layer:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
[[[[  0   2   4]
[ 6 8 10]
[ 24 26 28]]

[[ 30 32 34]
[ 48 50 52]
[ 54 56 58]]

[[ 72 74 76]
[ 78 80 82]
[ 96 98 100]]

[[102 104 106]
[120 122 124]
[126 128 130]]

[[ 1 3 5]
[ 7 9 11]
[ 25 27 29]]

[[ 31 33 35]
[ 49 51 53]
[ 55 57 59]]

[[ 73 75 77]
[ 79 81 83]
[ 97 99 101]]

[[103 105 107]
[121 123 125]
[127 129 131]]

[[ 12 14 16]
[ 18 20 22]
[ 36 38 40]]

[[ 42 44 46]
[ 60 62 64]
[ 66 68 70]]

[[ 84 86 88]
[ 90 92 94]
[108 110 112]]

[[114 116 118]
[132 134 136]
[138 140 142]]

[[ 13 15 17]
[ 19 21 23]
[ 37 39 41]]

[[ 43 45 47]
[ 61 63 65]
[ 67 69 71]]

[[ 85 87 89]
[ 91 93 95]
[109 111 113]]

[[115 117 119]
[133 135 137]
[139 141 143]]]


[[[144 146 148]
[150 152 154]
[168 170 172]]

[[174 176 178]
[192 194 196]
[198 200 202]]

[[216 218 220]
[222 224 226]
[240 242 244]]

[[246 248 250]
[264 266 268]
[270 272 274]]

[[145 147 149]
[151 153 155]
[169 171 173]]

[[175 177 179]
[193 195 197]
[199 201 203]]

[[217 219 221]
[223 225 227]
[241 243 245]]

[[247 249 251]
[265 267 269]
[271 273 275]]

[[156 158 160]
[162 164 166]
[180 182 184]]

[[186 188 190]
[204 206 208]
[210 212 214]]

[[228 230 232]
[234 236 238]
[252 254 256]]

[[258 260 262]
[276 278 280]
[282 284 286]]

[[157 159 161]
[163 165 167]
[181 183 185]]

[[187 189 191]
[205 207 209]
[211 213 215]]

[[229 231 233]
[235 237 239]
[253 255 257]]

[[259 261 263]
[277 279 281]
[283 285 287]]]]

This result is not the same as what the public people were expecting, which means that all those people were wrong.

My Reorg Result

The output tensor of shape [2, 16, 3, 3] from my reorg layer:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
[[[[  0   2   4]
[ 12 14 16]
[ 24 26 28]]

[[ 36 38 40]
[ 48 50 52]
[ 60 62 64]]

[[ 72 74 76]
[ 84 86 88]
[ 96 98 100]]

[[108 110 112]
[120 122 124]
[132 134 136]]

[[ 1 3 5]
[ 13 15 17]
[ 25 27 29]]

[[ 37 39 41]
[ 49 51 53]
[ 61 63 65]]

[[ 73 75 77]
[ 85 87 89]
[ 97 99 101]]

[[109 111 113]
[121 123 125]
[133 135 137]]

[[ 6 8 10]
[ 18 20 22]
[ 30 32 34]]

[[ 42 44 46]
[ 54 56 58]
[ 66 68 70]]

[[ 78 80 82]
[ 90 92 94]
[102 104 106]]

[[114 116 118]
[126 128 130]
[138 140 142]]

[[ 7 9 11]
[ 19 21 23]
[ 31 33 35]]

[[ 43 45 47]
[ 55 57 59]
[ 67 69 71]]

[[ 79 81 83]
[ 91 93 95]
[103 105 107]]

[[115 117 119]
[127 129 131]
[139 141 143]]]


[[[144 146 148]
[156 158 160]
[168 170 172]]

[[180 182 184]
[192 194 196]
[204 206 208]]

[[216 218 220]
[228 230 232]
[240 242 244]]

[[252 254 256]
[264 266 268]
[276 278 280]]

[[145 147 149]
[157 159 161]
[169 171 173]]

[[181 183 185]
[193 195 197]
[205 207 209]]

[[217 219 221]
[229 231 233]
[241 243 245]]

[[253 255 257]
[265 267 269]
[277 279 281]]

[[150 152 154]
[162 164 166]
[174 176 178]]

[[186 188 190]
[198 200 202]
[210 212 214]]

[[222 224 226]
[234 236 238]
[246 248 250]]

[[258 260 262]
[270 272 274]
[282 284 286]]

[[151 153 155]
[163 165 167]
[175 177 179]]

[[187 189 191]
[199 201 203]
[211 213 215]]

[[223 225 227]
[235 237 239]
[247 249 251]]

[[259 261 263]
[271 273 275]
[283 285 287]]]]

This result meets the expectation from the public. Although it looked “elegant”, it is different from the behavior of the reorg layer which the original author was using.

Reorg Mechanism?

I apologize I did not make a figure on this because it is too tedious. If someone could make one for me, I would be very appreciated. To the best of my knowledge, this is the only blog post showing how reorg layer works correctly.

Final Remarks

At first, I thought this process is a reverse process of the sub-pixel convolutional layer. But it looks like that it is slightly different and more complicated. I am not sure if the YOLO v2 implementations using frameworks such as TensorFlow implemented this reorg layer correctly because people tend to use built-in functions such as tf.depth_to_space and I believe these functions are slightly different from what the original reorg layer does. However, this is not probably not going to affect the performance of neural networks since there will be no information loss if you implement reorg layer differently and our neural network is usually smart enough to figure out how to extract useful information from the right places. Probably the author also made the error in implementation as I mentioned above so that the reorganization behavior was not expected by the people including myself, but the neural network could “correct” this “error”.

If I did make a mistake in understanding the author’s reorganization layer implementation, please do let me know.

Author

Lei Mao

Posted on

04-24-2019

Updated on

04-24-2019

Licensed under


Comments