Lei Mao bio photo

Lei Mao

Machine Learning, Artificial Intelligence, Computer Science.

Twitter Facebook LinkedIn GitHub   G. Scholar E-Mail RSS

Introduction

The convolution and cross-correlation computation for two discrete sequences using the definition is costly because the asymptotic complexity is $O(N^2)$ where $N$ is the length of the sequence. The convolution theorem suggests that the convolution and cross-correlation could be computed using Fourier transform. There are asymptotically faster $O(N \log N)$ faster Fourier transform algorithms to compute Fourier transform which makes convolution and cross-correlation computation also asymptotically faster.


In this blog post, I would like to discuss Fourier transform, convolution theorem, and why convolution in neural networks could be computed asymptotically faster using faster Fourier transform.

Fourier Transform

Fourier Transform

Continuous Fourier transform.

\[\begin{align} F(k) &= \mathcal{F}_{x} \{f(x)\}(k) \\ &= \int_{-\infty}^{\infty} f(x) e^{-2\pi ikx} dx \\ \end{align}\]

Discrete-time Fourier transform.

\[\begin{align} F[k] &= \mathcal{F}_{x} \{f[x]\}[k] \\ &= \sum_{x = -\infty}^{\infty} f[x] e^{-2\pi ikx} \\ \end{align}\]

Inverse Fourier Transform

Continuous inverse Fourier transform.

\[\begin{align} f(x) &= \mathcal{F}^{-1}_{k} \{F(k)\}(x) \\ &= \int_{-\infty}^{\infty} F(k) e^{2\pi ikx} dk \\ \end{align}\]

Discrete-time inverse Fourier transform.

\[\begin{align} f[x] &= \mathcal{F}^{-1}_{k} \{F[k]\}[x] \\ &= \sum_{k = -\infty}^{\infty} F[k] e^{2\pi ikx} \\ \end{align}\]

Convolution Theorem

Convolution

For continuous complex-valued functions $f$ and $g$, the convolution is defined as

\[\begin{gather} (f \star g)(x) = \int_{-\infty}^{\infty} f(\tau) g(x - \tau) d \tau \end{gather}\]

Similarly, for discrete sequences, the convolution is defined as

\[\begin{gather} (f \star g)[n] = \sum_{m = -\infty}^{\infty} f[m] g[n - m] \end{gather}\]

Here we denote the convolution operation using $\star$.

Convolution Theorem

Consider two continuous functions $g(x)$ and $h(x)$ with Fourier transforms $G$ and $H$

\[\begin{align} G(k) &= \mathcal{F}_{x} \{g(x)\}(k) \\ &= \int_{-\infty}^{\infty} g(x) e^{-2\pi ikx} dx \\ \end{align}\] \[\begin{align} H(k) &= \mathcal{F}_{x} \{h(x)\}(k) \\ &= \int_{-\infty}^{\infty} h(x) e^{-2\pi ikx} dx \\ \end{align}\]

We define the convolution outcome $r(x)$

\[\begin{align} r(x) &= (g \star h)(x) \\ &= \int_{-\infty}^{\infty} g(\tau) h(x - \tau) d \tau \\ \end{align}\]

The convolution theorem states that

\[\begin{align} R(k) &= \mathcal{F}_{x} \{r(x)\}(k) \\ &= G(k) H(k) \\ &= (G \cdot H)(k) \\ \end{align}\]

With the inverse Fourier transform, we have

\[\begin{align} r(x) &= (g \star h)(x) \\ &= \mathcal{F}^{-1}_{k} \{R(k)\}(x) \\ &= \mathcal{F}^{-1}_{k} \{ R(k) \}(x) \\ &= \mathcal{F}^{-1}_{k} \{ (G \cdot H)(k) \}(x) \\ \end{align}\]

where

\[\begin{align} r(x) &= \mathcal{F}^{-1}_{k} \{R(k)\}(x) \\ &= \int_{-\infty}^{\infty} R(k) e^{2\pi ikx} dk \\ \end{align}\]

This means that to compute the convolution $(g \star h)(x)$, instead of doing direct summation using the convolution definition, we can also do inverse Fourier transform on the function $(G \cdot H)(k)$.


We would like to show a quick proof for the convolution theorem.


Proof


The convolution theorem could be proved using Fubini’s Theorem.

\[\begin{align} R(k) &= \mathcal{F}_{x} \{r(x)\}(k) \\ &= \int_{-\infty}^{\infty} r(x) e^{-2\pi ikx} dx \\ &= \int_{-\infty}^{\infty} \Big( \int_{-\infty}^{\infty} g(\tau) h(x - \tau) d \tau \Big) e^{-2\pi ikx} dx \\ &= \int_{-\infty}^{\infty} g(\tau) \Big( \int_{-\infty}^{\infty} h(x - \tau) e^{-2\pi ikx} d x \Big) d \tau \\ &= \int_{-\infty}^{\infty} g(\tau) \Big( \int_{-\infty}^{\infty} h(x - \tau) e^{-2\pi ik(x - \tau)} e^{-2\pi ik\tau} d (x - \tau) \Big) d \tau \\ &= \int_{-\infty}^{\infty} g(\tau) e^{-2\pi ik\tau} \Big( \int_{-\infty}^{\infty} h(x - \tau) e^{-2\pi ik(x - \tau)} d (x - \tau) \Big) d \tau \\ &= \int_{-\infty}^{\infty} g(\tau) e^{-2\pi ik\tau} \Big( \int_{-\infty}^{\infty} h(x^{\prime}) e^{-2\pi ikx^{\prime}} d x^{\prime} \Big) d \tau \\ &= \int_{-\infty}^{\infty} g(\tau) e^{-2\pi ik\tau} H(k) d \tau \\ &= H(k) \int_{-\infty}^{\infty} g(\tau) e^{-2\pi ik\tau} d \tau \\ &= G(k) H(k) \\ &= (G \cdot H)(k) \\ \end{align}\]

This concludes the proof.


Similarly, for two discrete sequences $g[x]$ and $h[x]$ with discrete-time Fourier transforms (DTFT) $G$ and $H$

\[\begin{align} G[k] &= \mathcal{F}_{x} \{g[x]\}[k] \\ &= \sum_{x = -\infty}^{\infty} g[x] e^{-2\pi ikx} \\ \end{align}\] \[\begin{align} H[k] &= \mathcal{F}_{x} \{h[x]\}[k] \\ &= \sum_{x = -\infty}^{\infty} h[k] e^{-2\pi ikx} \\ \end{align}\]

We define the convolution outcome $r[x]$

\[\begin{align} r[x] &= (g \star h)[x] \\ &= \sum_{\tau = -\infty}^{\infty} g[\tau] h[x - \tau] \\ \end{align}\]

The convolution theorem states that

\[\begin{align} R[k] &= \mathcal{F}_{x} \{r[x]\}(k) \\ &= G[k] H[k] \\ &= (G \cdot H)[k] \\ \end{align}\]

With the inverse Fourier transform, we have

\[\begin{align} r[x] &= (g \star h)[x] \\ &= \mathcal{F}^{-1}_{k} \{R[k]\}[x] \\ &= \mathcal{F}^{-1}_{k} \{ R[k] \}[x] \\ &= \mathcal{F}^{-1}_{k} \{ (G \cdot H)[k] \}[x] \\ \end{align}\]

where

\[\begin{align} r[x] &= \mathcal{F}^{-1}_{k} \{R[k]\}(x) \\ &= \sum_{k = -\infty}^{\infty} R[k] e^{2\pi ikx} \\ \end{align}\]

We will skip the proof for the discrete sequences case, as the proof is almost the same as the one for the continuous function case.


So, in short, the convolution theorem states that

\[\mathcal{F}\{g \star h\} = \mathcal{F}\{g\} \cdot \mathcal{F}\{h\}\]

and

\[g \star h = \mathcal{F}^{-1} \{ \mathcal{F}\{g\} \cdot \mathcal{F}\{h\} \}\]

“Cross-Correlation” Theorem

Cross-Correlation

For continuous complex-valued functions $f$ and $g$, the cross-correlation is defined as

\[\begin{gather} (f \ast g)(x) = \int_{-\infty}^{\infty} \overline{f(\tau)} g(x + \tau) d \tau \end{gather}\]

Similarly, for discrete sequences, the cross-correlation is defined as

\[\begin{gather} (f \ast g)[n] = \sum_{m = -\infty}^{\infty} \overline{f[m]} g[n + m] \end{gather}\]

“Cross-Correlation” Theorem

Because the “cross-correlation” theorem can be derived analogous to the convolution theorem, we will skip most of the details and derivations. The “cross-correlation” theorem states that

\[\mathcal{F}\{g \ast h\} = \overline{\mathcal{F}\{g\}} \cdot \mathcal{F}\{h\}\]

and

\[g \ast h = \mathcal{F}^{-1} \{ \overline{\mathcal{F}\{g\}} \cdot \mathcal{F}\{h\} \}\]

Fast Fourier Transform for Convolution

Given a discrete sequence $f[x]$ of length $N$, to compute its discrete-time Fourier transform sequence $F[k]$ of length $N$ using the direct summation, it is not difficult to see that it takes at least $N \times N$ multiplication and summation. So the asymptotic complexity for the vanilla Fourier transform is $O(N^2)$.


There are fast Fourier transform (FFT) which can do the same transform in an asymptotic complexity of $O(N \log N)$. Therefore, when $N$ is very large, doing fast Fourier transform will be much faster than doing vanilla discrete-time Fourier transform. We will not talk about the fast Fourier transform algorithms in this blog post.


Given two discrete sequences $g[x]$ and $h[x]$, to compute the convolution using fast Fourier transform, we have to do two Fourier transform for the two discrete sequences $g[x]$ and $h[x]$ ($O(N \log N)$), compute their dot product ($O(N)$), and do another inverse fast Fourier transform $O(N \log N)$.

\[g \star h = \mathcal{F}^{-1} \{ \mathcal{F}\{g\} \cdot \mathcal{F}\{h\} \}\]

The asymptotic complexity is $O(N \log N)$, whereas computing the convolution using convolution definition takes $O(N^2)$.

Conclusion

For the convolution in neural networks, when the input size and the kernel size are very large, computing convolution (actually it is cross-correlation) using fast Fourier transform will be much faster than computing using the definition. However, for most of the neural networks we have seen nowadays, the input size and the kernel size are usually large enough for fast Fourier transform to outperform the vanilla computation using the convolution definition.

References