Neural Network 1 x 1 Convolution Horizontal Fusion
Introduction
Neural network inference is the most critical topic for deep learning productization and commercialization. To execute neural network inference, kernels are invoked for neural network layers in order to compute the output tensors given the input tensors. Each kernel call will bring some overhead time. If many kernels were invoked for a neural network, the total overhead time might be very significant in a latency constraint system. So to achieve high throughput and low latency for neural network inference, the rule of thumb is to have fewer large kernel calls instead of many small kernel calls.
Given a pretrained neural network, all the layers have been fixed. In the worst scenario, each layer will invoke one kernel, and the total overhead time must be very significant for large neural networks. In order to reduce the number of kernel calls, we have to fuse the layers so that one kernel call does the computation for many neural network layers.
The neural network layer fusion could usually be categorized into horizontal layer fusion and vertical layer fusion. In my previous blog post “Neural Network Batch Normalization Fusion”, we have discussed the vertical layer fusion, including batch normalization fusion in particular. The vertical layer fusions are very common in neural network inference optimizations. However, the horizontal layer fusions are “rare”. In this blog post, I would like to discuss the horizontal layer fusion and more specifically the
Inception Neural Networks
Neural networks, such as Google’s Inception neural networks, sometimes use branched convolutions given the an input tensor.

Inception Module
Note that the inception module (with dimension reduction) has four branches from a single input. Among three of the four branches,
Convolution Fusion
Suppose
where
The two middle
Consider the
Now let’s start the horizontal layer fusion optimization. We define a new matrix
We further define another new matrix
The
In order to obtain each
More General Convolution Fusion?
Theoretically, all the convolutions could be equivalently transformed to matrix multiplications. However, except
Conclusions
We have mathematically derived the validness of fusing several
Neural Network 1 x 1 Convolution Horizontal Fusion
https://leimao.github.io/blog/Neural-Network-1x1-Convolution-Fusion/