Neural Network Batch Normalization Fusion
Introduction
Neural network inference is the most critical topic for deep learning productization and commercialization. To execute neural network inference, kernels are invoked for neural network layers in order to compute the output tensors given the input tensors. Each kernel call will bring some overhead time. If many kernels were invoked for a neural network, the total overhead time might be very significant in a latency constraint system. So to achieve high throughput and low latency for neural network inference, the rule of thumb is to have fewer large kernel calls instead of many small kernel calls.
Given a pretrained neural network, all the layers have been fixed. In the worst scenario, each layer will invoke one kernel, and the total overhead time must be very significant for large neural networks. In order to reduce the number of kernel calls, we have to fuse the layers so that one kernel call does the computation for many neural network layers.
The neural network layer fusion could usually be categorized into horizontal layer fusion and vertical layer fusion. Batch normalization layer is often fused with the convolutional layer before it. This fusion belongs to the vertical layer fusion.
In this blog post, I would like to discuss the mathematics on batch normalization fusion.
Batch Normalization Fusion
Batch normalization has been explained in detail in my previous article “Batch Normalization Explained”.
Convolution and Batch Normalization Fusion
The batch normalization layer has four parameters, $\mu$, $\sigma^2$, $\gamma$, and $\beta$. Specifically for the batch normalization layer after the convolutional layer, $\mu \in \mathbb{R}^{C^{\prime}}$, $\sigma^2 \in \mathbb{R}^{C^{\prime}}$, $\gamma \in \mathbb{R}^{C^{\prime}}$, and $\beta \in \mathbb{R}^{C^{\prime}}$, where $C^{\prime}$ is the number of output channels from the previous convolutional layer.
Suppose $X \in \mathbb{R}^{N \times H \times W \times C}$ is the input tensor to the convolutional layer, $W \in \mathbb{R}^{C \times h \times w \times C^{\prime}}$ is the weight parameter, and $b \in \mathbb{R}^{C^{\prime}}$ is the bias parameter, and $Y \in \mathbb{R}^{N \times H \times W \times C^{\prime}}$ is the output tensor from the convolutional layer, assuming “same” padding and stride is $1$.
$$
Y = X \ast W + b
$$
where $\ast$ is the convolutional operator.
Suppose $X^{\prime}$ is the input tensor to the batch normalization layer, i.e., the output from the previous convolutional layer, $X^{\prime} = Y$. For each channel $c^{\prime} \in \{1,2, \cdots, C^{\prime}\}$, we have
$$
\hat{X}^{\prime}_{i,j,k,c^{\prime}} = \frac{X^{\prime}_{i,j,k,c^{\prime}}-\mu_{c^{\prime}}}{\sqrt{\sigma_{c^{\prime}}^2 + \epsilon}}
$$
$$
Y^{\prime}_{i,j,k,c^{\prime}} = \gamma_{c^{\prime}} \hat{X}^{\prime}_{i,j,k,c^{\prime}} + \beta_{c^{\prime}}
$$
where $Y^{\prime}$ is the output tensor from the batch normalization layer and $Y^{\prime} \in \mathbb{R}^{N \times H^{\prime} \times W^{\prime} \times C^{\prime}}$.
Let’s put the mathematics of the convolutional layer and the batch normalization layer together.
$$
\begin{align}
Y^{\prime}_{i,j,k,c^{\prime}} &= \gamma_{c^{\prime}} \hat{X}^{\prime}_{i,j,k,c^{\prime}} + \beta_{c^{\prime}} \\
&= \gamma_{c^{\prime}} \frac{X^{\prime}_{i,j,k,c^{\prime}}-\mu_{c^{\prime}}}{\sqrt{\sigma_{c^{\prime}}^2 + \epsilon}} + \beta_{c^{\prime}} \\
&= \frac{\gamma_{c^{\prime}}}{\sqrt{\sigma_{c^{\prime}}^2 + \epsilon}} X^{\prime}_{i,j,k,c^{\prime}} + \bigg( \beta_{c^{\prime}} - \frac{\gamma_{c^{\prime}}}{\sqrt{\sigma_{c^{\prime}}^2 + \epsilon}} \mu_{c^{\prime}} \bigg)\\
&= \frac{\gamma_{c^{\prime}}}{\sqrt{\sigma_{c^{\prime}}^2 + \epsilon}} Y_{i,j,k,c^{\prime}} + \bigg( \beta_{c^{\prime}} - \frac{\gamma_{c^{\prime}}}{\sqrt{\sigma_{c^{\prime}}^2 + \epsilon}} \mu_{c^{\prime}} \bigg)\\
&= \frac{\gamma_{c^{\prime}}}{\sqrt{\sigma_{c^{\prime}}^2 + \epsilon}} (X \ast W + b)_{i,j,k,c^{\prime}} + \bigg( \beta_{c^{\prime}} - \frac{\gamma_{c^{\prime}}}{\sqrt{\sigma_{c^{\prime}}^2 + \epsilon}} \mu_{c^{\prime}} \bigg)\\
&= \frac{\gamma_{c^{\prime}}}{\sqrt{\sigma_{c^{\prime}}^2 + \epsilon}} \bigg((X \ast W )_{i,j,k,c^{\prime}} + b_{c^{\prime}} \bigg) + \bigg( \beta_{c^{\prime}} - \frac{\gamma_{c^{\prime}}}{\sqrt{\sigma_{c^{\prime}}^2 + \epsilon}} \mu_{c^{\prime}} \bigg)\\
&= \frac{\gamma_{c^{\prime}}}{\sqrt{\sigma_{c^{\prime}}^2 + \epsilon}} (X \ast W )_{i,j,k,c^{\prime}} + \frac{\gamma_{c^{\prime}}}{\sqrt{\sigma_{c^{\prime}}^2 + \epsilon}} b_{c^{\prime}} + \bigg( \beta_{c^{\prime}} - \frac{\gamma_{c^{\prime}}}{\sqrt{\sigma_{c^{\prime}}^2 + \epsilon}} \mu_{c^{\prime}} \bigg)\\
&= \bigg(X \ast \Big( \frac{\gamma}{\sqrt{\sigma^2 + \epsilon}} W \Big) \bigg)_{i,j,k,c^{\prime}} + \bigg( \beta_{c^{\prime}} + \frac{\gamma_{c^{\prime}}}{\sqrt{\sigma_{c^{\prime}}^2 + \epsilon}} \Big( b_{c^{\prime}} - \mu_{c^{\prime}} \Big) \bigg)\\
&= \bigg(X \ast \Big( \frac{\gamma}{\sqrt{\sigma^2 + \epsilon}} W \Big) + \beta + \frac{\gamma}{\sqrt{\sigma^2 + \epsilon}} \Big( b - \mu \Big) \bigg)_{i,j,k,c^{\prime}}\\
\end{align}
$$
Therefore, the fusion of the convolutional layer and the batch normalization layer is just a new convolutional layer, where the input tensor $X$ remained unchanged and the weight parameter and the bias parameter changed.
$$
Y^{\prime} = X \ast W^{\prime} + b^{\prime}
$$
where
$$
W^{\prime} = \frac{\gamma}{\sqrt{\sigma^2 + \epsilon}} W
$$
and
$$
b^{\prime} = \beta + \frac{\gamma}{\sqrt{\sigma^2 + \epsilon}} \Big( b - \mu \Big)
$$
More specifically,
$$
W_{:,:,:,c^{\prime}}^{\prime} = \frac{\gamma_{c^{\prime}}}{\sqrt{\sigma_{c^{\prime}}^2 + \epsilon}} W_{:,:,:,c^{\prime}}
$$
and
$$
b_{c^{\prime}}^{\prime} = \beta_{c^{\prime}} + \frac{\gamma_{c^{\prime}}}{\sqrt{\sigma_{c^{\prime}}^2 + \epsilon}} \Big( b_{c^{\prime}} - \mu_{c^{\prime}} \Big)
$$
Now, we could use a convolution kernel call to finish the computation for the original convolution kernel call and the original batch normalization call without even having additional number of floating point computations.
Batch Normalization and Convolution (Without Padding) Fusion
Suppose $X \in \mathbb{R}^{N \times H \times W \times C}$ is the input tensor to the batch normalization layer. $\mu \in \mathbb{R}^{C}$, $\sigma^2 \in \mathbb{R}^{C}$, $\gamma \in \mathbb{R}^{C}$, $\beta \in \mathbb{R}^{C}$, $\epsilon \in \mathbb{R}$.
$$
\hat{X} = \frac{X-\mu}{\sqrt{\sigma^2 + \epsilon}}
$$
$$
Y = \gamma \hat{X} + \beta
$$
where $Y$ is the output tensor from the batch normalization layer and $Y \in \mathbb{R}^{N \times H \times W \times C}$.
For each channel $c \in \{1,2, \cdots, C\}$ specifically, we have
$$
\hat{X}_{i,j,k,c} = \frac{X_{i,j,k,c}-\mu_{c}}{\sqrt{\sigma_{c}^2 + \epsilon}}
$$
$$
Y_{i,j,k,c} = \gamma_{c} \hat{X}_{i,j,k,c} + \beta_{c}
$$
Suppose $X^{\prime} \in \mathbb{R}^{N \times H \times W \times C}$ is the input tensor to the convolutional layer, $X^{\prime} = Y$, $W^{\prime} \in \mathbb{R}^{C \times h \times w \times C^{\prime}}$ is the weight parameter, and $b^{\prime} \in \mathbb{R}^{C^{\prime}}$ is the bias parameter, and $Y^{\prime}$ is the output tensor from the convolutional layer
$$
Y^{\prime} = X^{\prime} \ast W^{\prime} + b^{\prime}
$$
where $\ast$ is the convolutional operator.
Let’s put the mathematics of the batch normalization layer and the convolutional layer together. Let’s first assume there is no padding for the convolutional layer.
$$
\begin{align}
Y^{\prime} &= X^{\prime} \ast W^{\prime} + b^{\prime}\\
&= Y \ast W^{\prime} + b^{\prime}\\
&= \left( \gamma \hat{X} + \beta \right) \ast W^{\prime} + b^{\prime}\\
&= \left( \gamma \left( \frac{X-\mu}{\sqrt{\sigma^2 + \epsilon}} \right) + \beta \right) \ast W^{\prime} + b^{\prime}\\
&= \left( \frac{\gamma}{\sqrt{\sigma^2 + \epsilon}} X + \left( \beta - \frac{\gamma\mu}{\sqrt{\sigma^2 + \epsilon}} \right) \right) \ast W^{\prime} + b^{\prime}\\
\end{align}
$$
Notice that $\left( \beta - \frac{\gamma\mu}{\sqrt{\sigma^2 + \epsilon}} \right)$ is a term that’s broadcasted such that $\left( \beta - \frac{\gamma\mu}{\sqrt{\sigma^2 + \epsilon}} \right) \in \mathbb{R}^{N \times H \times W \times C}$, and the following relationship is valid for the convolutional operator.
$$
(A + B) \ast W = A \ast W + B \ast W
$$
Thus, we continue to have
$$
\begin{align}
Y^{\prime}
&= \left( \frac{\gamma}{\sqrt{\sigma^2 + \epsilon}} X + \left( \beta - \frac{\gamma\mu}{\sqrt{\sigma^2 + \epsilon}} \right) \right) \ast W^{\prime} + b^{\prime}\\
&= \left( \frac{\gamma}{\sqrt{\sigma^2 + \epsilon}} X \right) \ast W^{\prime} + \left( \beta - \frac{\gamma\mu}{\sqrt{\sigma^2 + \epsilon}} \right) \ast W^{\prime} + b^{\prime}\\
&= X \ast \left( \frac{\gamma}{\sqrt{\sigma^2 + \epsilon}} W^{\prime} \right) + \left( \beta - \frac{\gamma\mu}{\sqrt{\sigma^2 + \epsilon}} \right) \ast W^{\prime} + b^{\prime}\\
\end{align}
$$
Notice that $\left( \beta - \frac{\gamma\mu}{\sqrt{\sigma^2 + \epsilon}} \right) \ast W^{\prime}$ is a constant tensor that can be computed offline. The values at the same channel of this tensor are exactly the same. Therefore, it can be reduced to a vector that can be merged into $b^{\prime}$.
Therefore, assuming no padding, the fusion of the batch normalization layer and the convolutional layer is also just a new convolutional layer.
$$
Y^{\prime} = X^{\prime} \ast W + b
$$
where
$$
W = \frac{\gamma}{\sqrt{\sigma^2 + \epsilon}} W^{\prime}
$$
$$
b = \left(\left( \beta - \frac{\gamma\mu}{\sqrt{\sigma^2 + \epsilon}} \right) \ast W^{\prime}\right)_{1,1,1,:} + b^{\prime}
$$
Batch Normalization and Convolution (With Padding) Fusion
Now, what if there are paddings in the convolutional layer? If fact, it is quite common in the neural networks to have paddings in the convolutional layer.
Here are some problems.
$$
\begin{align}
\left( \frac{\gamma}{\sqrt{\sigma^2 + \epsilon}} X \right) \ast W^{\prime} &=
X \ast \left( \frac{\gamma}{\sqrt{\sigma^2 + \epsilon}} W^{\prime} \right) \\
\end{align}
$$
is not always valid if there are paddings. Note that a similar equation
$$
\begin{align}
\frac{\gamma}{\sqrt{\sigma^2 + \epsilon}} \left( X \ast W^{\prime} \right) &=
X \ast \left( \frac{\gamma}{\sqrt{\sigma^2 + \epsilon}} W^{\prime} \right) \\
\end{align}
$$
is always valid. This is also used in the convolutional layer and the the batch normalization layer fusion.
The values at the same channel of the constant tensor $\left( \beta - \frac{\gamma\mu}{\sqrt{\sigma^2 + \epsilon}} \right) \ast W^{\prime}$ are no longer exactly the same because of the padding. Therefore, it cannot be reduced to a vector anymore.
Suppose the padding value at certain position in the original convolutional operation $\ast$ is $v$ and the padding value at the same position in the fused convolutional operation $\ast^{\prime}$ is $v^{\prime}$.
$$
\begin{align}
Y^{\prime}
&= \left( \frac{\gamma}{\sqrt{\sigma^2 + \epsilon}} X \right) \ast W^{\prime} + \left( \beta - \frac{\gamma\mu}{\sqrt{\sigma^2 + \epsilon}} \right) \ast W^{\prime} + b^{\prime}\\
&= X \ast^{\prime} \left( \frac{\gamma}{\sqrt{\sigma^2 + \epsilon}} W^{\prime} \right) + \left( \beta - \frac{\gamma\mu}{\sqrt{\sigma^2 + \epsilon}} \right) \ast W^{\prime} + b^{\prime}\\
\end{align}
$$
When $v^{\prime} = \frac{v}{\frac{\gamma}{\sqrt{\sigma^2 + \epsilon}}}$, we could transform the convolution operation.
$$
\begin{align}
\left( \frac{\gamma}{\sqrt{\sigma^2 + \epsilon}} X \right) \ast W^{\prime} &=
X \ast^{\prime} \left( \frac{\gamma}{\sqrt{\sigma^2 + \epsilon}} W^{\prime} \right) \\
\end{align}
$$
However, the problem that the constant tensor $\left( \beta - \frac{\gamma\mu}{\sqrt{\sigma^2 + \epsilon}} \right) \ast W^{\prime}$ cannot be reduced to a vector remains, unless
$$
\frac{v}{\frac{\gamma}{\sqrt{\sigma^2 + \epsilon}}} = \beta - \frac{\gamma\mu}{\sqrt{\sigma^2 + \epsilon}}
$$
which is almost impossible to happen.
Therefore, if there are paddings, the batch normalization and convolutional layers could not be fused to a single convolutional layer. It can be transformed into a convolutional layer and an element-wise addition layer. Depending on the hardware and software implementation, it may or may not be faster than the original batch normalization and convolutional layers.
Conclusion
We have mathematically derived the layer fusion for convolutional layer and batch normalization layer. There are also other vertical layer fusions that can be similarly mathematically derived, such as the layer fusion for convolutional layer and batch normalization layer and the ReLU layer.
Notes
PyTorch has utility functions, such as torch.quantization.fuse_modules
, that allows us to fuse some layers, including the convolution and batch normalization fusion, for inference. In addition, turning on do_constant_folding
in torch.onnx.export
allows us to fuse some layers, including the convolution and batch normalization fusion, in the exported ONNX model.
Neural Network Batch Normalization Fusion
https://leimao.github.io/blog/Neural-Network-Batch-Normalization-Fusion/