### Introduction

Neural network inference is the most critical topic for deep learning productization and commercialization. To execute neural network inference, kernels are invoked for neural network layers in order to compute the output tensors given the input tensors. Each kernel call will bring some overhead time. If many kernels were invoked for a neural network, the total overhead time might be very significant in a latency constraint system. So to achieve high throughput and low latency for neural network inference, the rule of thumb is to have fewer large kernel calls instead of many small kernel calls.

Given a pretrained neural network, all the layers have been fixed. In the worst scenario, each layer will invoke one kernel, and the total overhead time must be very significant for large neural networks. In order to reduce the number of kernel calls, we have to fuse the layers so that one kernel call does the computation for many neural network layers.

The neural network layer fusion could usually be categorized into horizontal layer fusion and vertical layer fusion. Batch normalization layer is often fused with the convolutional layer before it. This fusion belongs to the vertical layer fusion.

In this blog post, I would like to discuss the mathematics on batch normalization fusion.

### Batch Normalization Fusion

Batch normalization has been explained in detail in my previous article “Batch Normalization Explained”. The batch normalization layer has four parameters, $\mu$, $\sigma^2$, $\gamma$, and $\beta$. Specifically for the batch normalization layer after the convolutional layer, $\mu \in \mathbb{R}^{C^{\prime}}$, $\sigma^2 \in \mathbb{R}^{C^{\prime}}$, $\gamma \in \mathbb{R}^{C^{\prime}}$, and $\beta \in \mathbb{R}^{C^{\prime}}$, where $C^{\prime}$ is the number of output channels from the previous convolutional layer.

Suppose $X \in \mathbb{R}^{N \times H \times W \times C}$ is the input tensor to the convolutional layer, $W \in \mathbb{R}^{C \times h \times w \times C^{\prime}}$ is the weight parameter, and $b \in \mathbb{R}^{C^{\prime}}$ is the bias parameter, and $Y \in \mathbb{R}^{N \times H \times W \times C^{\prime}}$ is the output tensor from the convolutional layer, assuming “same” padding and stride is $1$.

\[Y = X \ast W + b\]where $\ast$ is the convolutional operator.

Suppose $X^{\prime}$ is the input tensor to the batch normalization layer, i.e., the output from the previous convolutional layer, $X^{\prime} = Y$. For each channel $c^{\prime} \in \{1,2, \cdots, C^{\prime}\}$, we have

\[\hat{X}^{\prime}_{i,j,k,c^{\prime}} = \frac{X^{\prime}_{i,j,k,c^{\prime}}-\mu_{c^{\prime}}}{\sqrt{\sigma_{c^{\prime}}^2 + \epsilon}}\] \[Y^{\prime}_{i,j,k,c^{\prime}} = \gamma_{c^{\prime}} \hat{X}^{\prime}_{i,j,k,c^{\prime}} + \beta_{c^{\prime}}\]where $Y^{\prime}$ is the output tensor from the batch normalization layer and $Y^{\prime} \in \mathbb{R}^{N \times H^{\prime} \times W^{\prime} \times C^{\prime}}$.

Let’s put the mathematics of the convolutional layer and the batch normalization layer together.

\[\begin{align} Y^{\prime}_{i,j,k,c^{\prime}} &= \gamma_{c^{\prime}} \hat{X}^{\prime}_{i,j,k,c^{\prime}} + \beta_{c^{\prime}} \\ &= \gamma_{c^{\prime}} \frac{X^{\prime}_{i,j,k,c^{\prime}}-\mu_{c^{\prime}}}{\sqrt{\sigma_{c^{\prime}}^2 + \epsilon}} + \beta_{c^{\prime}} \\ &= \frac{\gamma_{c^{\prime}}}{\sqrt{\sigma_{c^{\prime}}^2 + \epsilon}} X^{\prime}_{i,j,k,c^{\prime}} + \bigg( \beta_{c^{\prime}} - \frac{\gamma_{c^{\prime}}}{\sqrt{\sigma_{c^{\prime}}^2 + \epsilon}} \mu_{c^{\prime}} \bigg)\\ &= \frac{\gamma_{c^{\prime}}}{\sqrt{\sigma_{c^{\prime}}^2 + \epsilon}} Y_{i,j,k,c^{\prime}} + \bigg( \beta_{c^{\prime}} - \frac{\gamma_{c^{\prime}}}{\sqrt{\sigma_{c^{\prime}}^2 + \epsilon}} \mu_{c^{\prime}} \bigg)\\ &= \frac{\gamma_{c^{\prime}}}{\sqrt{\sigma_{c^{\prime}}^2 + \epsilon}} (X \ast W + b)_{i,j,k,c^{\prime}} + \bigg( \beta_{c^{\prime}} - \frac{\gamma_{c^{\prime}}}{\sqrt{\sigma_{c^{\prime}}^2 + \epsilon}} \mu_{c^{\prime}} \bigg)\\ &= \frac{\gamma_{c^{\prime}}}{\sqrt{\sigma_{c^{\prime}}^2 + \epsilon}} \bigg((X \ast W )_{i,j,k,c^{\prime}} + b_{c^{\prime}} \bigg) + \bigg( \beta_{c^{\prime}} - \frac{\gamma_{c^{\prime}}}{\sqrt{\sigma_{c^{\prime}}^2 + \epsilon}} \mu_{c^{\prime}} \bigg)\\ &= \frac{\gamma_{c^{\prime}}}{\sqrt{\sigma_{c^{\prime}}^2 + \epsilon}} (X \ast W )_{i,j,k,c^{\prime}} + \frac{\gamma_{c^{\prime}}}{\sqrt{\sigma_{c^{\prime}}^2 + \epsilon}} b_{c^{\prime}} + \bigg( \beta_{c^{\prime}} - \frac{\gamma_{c^{\prime}}}{\sqrt{\sigma_{c^{\prime}}^2 + \epsilon}} \mu_{c^{\prime}} \bigg)\\ &= \bigg(X \ast \Big( \frac{\gamma}{\sqrt{\sigma^2 + \epsilon}} W \Big) \bigg)_{i,j,k,c^{\prime}} + \bigg( \beta_{c^{\prime}} + \frac{\gamma_{c^{\prime}}}{\sqrt{\sigma_{c^{\prime}}^2 + \epsilon}} \Big( b_{c^{\prime}} - \mu_{c^{\prime}} \Big) \bigg)\\ &= \bigg(X \ast \Big( \frac{\gamma}{\sqrt{\sigma^2 + \epsilon}} W \Big) + \beta + \frac{\gamma}{\sqrt{\sigma^2 + \epsilon}} \Big( b - \mu \Big) \bigg)_{i,j,k,c^{\prime}}\\ \end{align}\]Therefore, the fusion of the convolutional layer and the batch normalization layer is just a new convolutional layer, where the input tensor $X$ remained unchanged and the weight parameter and the bias parameter changed.

\[Y^{\prime} = X \ast W^{\prime} + b^{\prime}\]where

\[W^{\prime} = \frac{\gamma}{\sqrt{\sigma^2 + \epsilon}} W\]and

\[b^{\prime} = \beta + \frac{\gamma}{\sqrt{\sigma^2 + \epsilon}} \Big( b - \mu \Big)\]More specifically,

\[W_{:,:,:,c^{\prime}}^{\prime} = \frac{\gamma_{c^{\prime}}}{\sqrt{\sigma_{c^{\prime}}^2 + \epsilon}} W_{:,:,:,c^{\prime}}\]and

\[b_{c^{\prime}}^{\prime} = \beta_{c^{\prime}} + \frac{\gamma_{c^{\prime}}}{\sqrt{\sigma_{c^{\prime}}^2 + \epsilon}} \Big( b_{c^{\prime}} - \mu_{c^{\prime}} \Big)\]Now, we could use a convolution kernel call to finish the computation for the original convolution kernel call and the original batch normalization call without even having additional number of floating point computations.

### Conclusion

We have mathematically derived the layer fusion for convolutional layer and batch normalization layer. There are also other vertical layer fusions that can be similarly mathematically derived, such as the layer fusion for convolutional layer and batch normalization layer and the ReLU layer.

### Notes

PyTorch has utility functions, such as `torch.quantization.fuse_modules`

, that allows us to fuse some layers, including the convolution and batch normalization fusion, for inference. In addition, turning on `do_constant_folding`

in `torch.onnx.export`

allows us to fuse some layers, including the convolution and batch normalization fusion, in the exported ONNX model.