Convolution and Transposed Convolution as Matrix Multiplication
Introduction
In deep learning, convolution and transposed convolution are often used in the neural networks. Unlike convolution, transposed convolution is sometimes confusing and not too many people know why it’s called transposed convolution.
In this blog post, I would like to discuss how to view convolution and transposed convolution as matrix multiplication, and how to understand the name of transposed convolution.
Convolution and Transposed Convolution as Matrix Multiplication
At each step of the convolution operation, we apply the kernel tensor onto a portion of the elements in the input tensor, compute their element-wise products, and further sum the products together. We repeat the step many times until the kernel convolves over the entire tensor. With this in mind, we could actually formulate the convolution operation using matrix operations equivalently.
Concretely, for any convolution operation in deep learning, $Y = K \ast X$, it could be formulated as matrix multiplication
$$
Y^{\prime} = WX^{\prime}
$$
where $X^{\prime}$ is the flatten representation of $X$ which might have already been padded and dilated, $W$ is the sparse matrix representation of kernel $K$, and $Y^{\prime}$ is the flatten representation of the output $Y$.
More specifically, when $K$ and $X$ are 2D matrices, $K \in \mathbb{R}^{h_K \times w_K}$, $X \in \mathbb{R}^{h_X \times w_X}$, $X^{\prime} \in \mathbb{R}^{h_{X} w_{X}}$, $Y \in \mathbb{R}^{h_Y \times w_Y}$, $Y \in \mathbb{R}^{h_Y w_Y}$, we should have a sparse matrix $W \in \mathbb{R}^{h_Y w_Y \times h_{X} w_{X}}$ that does the same transformation as the convolution.
Similarly, for any transposed convolution operation in deep learning, $Z = K \star Y$, it could also be formulated as matrix multiplication
$$
Z^{\prime} = W^{\top} Y^{\prime}
$$
where $Z^{\prime}$ is the flatten representation of the output $Z$, $Y$ and $Y^{\prime}$ are just ones we just obtained from the previous convolution operation $Y = K \ast X$.
$Z$ must have $Z \in \mathbb{R}^{h_X \times w_X}$ and $Z^{\prime} \in \mathbb{R}^{h_{X} w_{X}}$, as if the shape have been reverted back before the convolution operation $Y = K \ast X$. Notice that the kernels used for the convolution and transposed convolution, $K$, are exactly the same, and the weight matrices used in the convolution and transposed convolution matrix multiplications, $W$ and $W^{\top}$, are just transposed to each other. $Z$, however, usually does not equal to $X$. Because $Z \neq X$, we cannot call transposed convolution as deconvolution.
Implementing Convolution and Transposed Convolution as Matrix Operation
Let’s ignore the channel dimension and the bias term for convolution and transposed convolution for now, and implement convolution and transposed convolution as matrix operation. We also assume the stride is 1 for both convolution and transposed convolution.
1 | import torch |
We could see that outputs from the matrix multiplication implementations match the expectation.
1 | $ python conv_as_gemm.py |
For multi-channel kernel, input tensor, and output tensor, the derivation of the weight matrix for matrix multiplication is more complicated, but it does not change the nature that the convolution and transposed convolution can be viewed as matrix multiplications.
Backward Propagation
Considering the convolution as matrix multiplications.
Given the flattened input vector $x \in \mathbb{R}^{m_1}$, weight matrix $W \in \mathbb{R}^{m_2 \times m_1}$, and the flattened output vector $y \in \mathbb{R}^{m_2}$, the matrix multiplication is
$$
y = Wx
$$
Because the gradients matrix with respect to the input $x$ is
$$
\nabla_{x}y = W^{\top}
$$
where $\nabla_{x}y \in \mathbb{R}^{m_1 \times m_2}$.
In the backward propagation, the input is $\frac{\partial L}{\partial y} \in \mathbb{R}^{m_2}$, and the output of the backward propagation is
$$
\begin{align}
\frac{\partial L}{\partial x} &= \nabla_{x}y \frac{\partial L}{\partial y} \\
&= W^{\top} \frac{\partial L}{\partial y}
\end{align}
$$
where $\frac{\partial L}{\partial x} \in \mathbb{R}^{m_1}$.
Therefore, in the forward propagation and the backward propagation of the convolution operation, the weight matrices are $W$ and $W^{\top}$, respectively.
Similarly, considering the transposed convolution as matrix multiplications. In the forward propagation and the backward propagation of the transposed convolution operation, the weight matrices are $W^{\top}$ and $W$, respectively.
What does it imply? If the convolution and transposed convolution were implemented as matrix multiplication. To compute the gradient matrix $W_1$ for transposed convolution operation $Z = K \star Y$, if we happen to have already computed the gradient matrix $W_2$ for $X = K \ast Z$, we must have $W_1 = W_2^{\top}$. However, I think in practice this property is less useful because usually the kernel used for different layers are different.
Because the weight matrices used in the forward propagation and the backward propagation of a transposed convolution are just the transpose of the weight matrices used in the forward propagation and the backward propagation of a convolution which has the same kernel parameters, that’s probably why transposed convolution is called transposed convolution.
Miscellaneous
We could see that without non-linear activation function, a sequence of convolution operations is just a linear function. Therefore, having non-linear activation is also important for convolutional neural networks.
References
Convolution and Transposed Convolution as Matrix Multiplication
https://leimao.github.io/blog/Convolution-Transposed-Convolution-As-Matrix-Multiplication/