Convolution Shape Inference
Introduction
When we design neural networks, it is often necessary to design the convolutions such that the output tensor shapes match certain criteria given the input tensor shapes and the convolution configurations.
PyTorch has given the formula to compute the convolution output shape in the torch.nn.Conv2d
documentation. In this blog post, let’s discuss how the formula was derived.
Convolution Shape Inference
Formula
The convolution output shape could be computed using the following formula.
$$
L_{\text{out}} = \bigg\lfloor \frac{L_{\text{in}} + 2p - d(k - 1) - 1}{s} + 1 \bigg\rfloor
$$
where
$L_{\text{in}}$ is the input size,
$L_{\text{out}}$ is the output size,
$p$ is the padding size,
$k$ is the kernel size,
$d$ is the dilation size,
$s$ is the stride size.
Derivation
Let’s first consider the scenario where there is no dilation and no padding, i.e., $d = 1$ and $p = 1$. We must have
$$
L_{\text{in}} - s < k + s(L_{\text{out}} - 1) \leq L_{\text{in}}
$$
We moved the terms around and got
$$
\frac{L_{\text{in}} - s - k}{s} + 1 < L_{\text{out}} \leq \frac{L_{\text{in}} - k}{s} + 1
$$
$$
\frac{L_{\text{in}} - k}{s} < L_{\text{out}} \leq \frac{L_{\text{in}} - k}{s} + 1
$$
Therefore,
$$
L_{\text{out}} = \bigg\lfloor \frac{L_{\text{in}} - k}{s} + 1 \bigg\rfloor
$$
When there is padding the input size will become $L + 2p$, so
$$
L_{\text{out}} = \bigg\lfloor \frac{L_{\text{in}} + 2p - k}{s} + 1 \bigg\rfloor
$$
When there is dilation, the kernel size will become
$$
(k - 1) (d - 1) + k = d(k - 1) + 1
$$
Therefore, the ultimate formula for convolution shape inference is
$$
L_{\text{out}} = \bigg\lfloor \frac{L_{\text{in}} + 2p - d(k - 1) - 1}{s} + 1 \bigg\rfloor
$$
Convolution Shape Inference