Convolution Shape Inference

Introduction

When we design neural networks, it is often necessary to design the convolutions such that the output tensor shapes match certain criteria given the input tensor shapes and the convolution configurations.

PyTorch has given the formula to compute the convolution output shape in the torch.nn.Conv2d documentation. In this blog post, let’s discuss about how the formula was derived.

Convolution Shape Inference

Formula

The convolution output shape could be computed using the following formula.

$$L_{\text{out}} = \bigg\lfloor \frac{L_{\text{in}} + 2p - d(k - 1) - 1}{s} + 1 \bigg\rfloor$$

where

$L_{\text{in}}$ is the input size,

$L_{\text{out}}$ is the output size,

$p$ is the padding size,

$k$ is the kernel size,

$d$ is the dilation size,

$s$ is the stride size.

Derivation

Let’s first consider the scenario where there is no dilation and no padding, i.e., $d = 1$ and $p = 1$. We must have

$$L_{\text{in}} - s < k + s(L_{\text{out}} - 1) \leq L_{\text{in}}$$

We moved the terms around and got

$$\frac{L_{\text{in}} - s - k}{s} + 1 < L_{\text{out}} \leq \frac{L_{\text{in}} - k}{s} + 1$$

$$\frac{L_{\text{in}} - k}{s} < L_{\text{out}} \leq \frac{L_{\text{in}} - k}{s} + 1$$

Therefore,

$$L_{\text{out}} = \bigg\lfloor \frac{L_{\text{in}} - k}{s} + 1 \bigg\rfloor$$

When there is padding the input size will become $L + 2p$, so

$$L_{\text{out}} = \bigg\lfloor \frac{L_{\text{in}} + 2p - k}{s} + 1 \bigg\rfloor$$

When there is dilation, the kernel size will become

$$(k - 1) (d - 1) + k = d(k - 1) + 1$$

Therefore, the ultimate formula for convolution shape inference is

$$L_{\text{out}} = \bigg\lfloor \frac{L_{\text{in}} + 2p - d(k - 1) - 1}{s} + 1 \bigg\rfloor$$

Lei Mao

01-17-2022

01-17-2022