# Depthwise Separable Convolution

## Introduction

Depthwise separable convolution reduces the memory and math bandwidth requirements for convolution in neural networks. Therefore, it is widely used for neural networks that are intended to run on edge devices.

In this blog post, I would like to briefly discuss about depthwise separable convolution and compare its computation cost with ordinary convolution.

## Depthwise Separable Convolution

We define $(K, C, R, S)$ as a convolution kernel that has a kernel shape of $(R, S)$, input channels of $C$, and output channels of $K$.

Depthwise separable convolution, sometimes referred as separable conv, performs $(1, 1, R, S)$ convolution for each input channel from the input and concatenation of all the convolution outputs as the intermediate output, followed by a $(K, C, 1, 1)$ convolution on the intermediate output.

If there is no bias term, ordinary convolution has $K \times C \times R \times S$ parameters, whereas depthwise separable convolution has $C \times R \times S + K \times C$ parameters. If there is bias term, we need additional $K$ parameters and $C + K$ parameters for ordinary convolution and depthwise separable convolution, respectively.

Let’s further take a look at the ratio of the number of parameters in depthwise separable convolution and ordinary convolution. Assuming $R \times S \ll \min(K, C)$ and $1 \ll \min(K, C)$,

\begin{align} \frac{K \times C \times R \times S + K}{C \times R \times S + K \times C + C + K} &\approx \frac{K \times C \times R \times S + K}{K \times C + C + K} \\ &= \frac{R \times S + \frac{1}{C}}{1 + \frac{1}{K} + \frac{1}{C}} \\ &\approx R \times S \\ \end{align}

Therefore, depthwise separable convolution could have $R \times S$ times fewer parameters than ordinary convolution.

## Convolution VS Depthwise Separable Convolution

We implemented depthwise separable convolution using basic convolution operators in PyTorch, and measured the number of parameters and MACs for convolution and depthwise separable convolution that have exactly the same input shape and output shape.

We could see that under common conventional settings, depthwise separable convolution uses much fewer parameters and MACs compared to ordinary convolution.

Lei Mao

11-08-2021

11-08-2021