# Automatic Differentiation

## Introduction

In mathematics and computer algebra, automatic differentiation (AD) is a set of techniques to evaluate the derivative of a function specified by a computer program. It is also the key optimization technique that makes deep learning and neural network successful.

Automatic differentiation has two distinct modes, forward mode (or forward accumulation) and reverse mode (or reverse accumulation). In this article, I would like to discuss about the two distinct modes of automatic differentiation in detail.

## Prerequisites

### Jacobian Matrix

The Jacobian matrix for function $f: \mathbb{R}^{n} \rightarrow \mathbb{R}^{m}$, denoted by $\mathbf{J}_{f} \in \mathbb{R}^{m \times n}$, is defined as follows.

\begin{align} \mathbf{J}_{f} &= \begin{bmatrix} \frac{\partial f_1}{\partial x_1} & \frac{\partial f_1}{\partial x_2} & \dots & \frac{\partial f_1}{\partial x_n} \\ \frac{\partial f_2}{\partial x_1} & \frac{\partial f_2}{\partial x_2} & \dots & \frac{\partial f_2}{\partial x_n} \\ \vdots & \vdots & \ddots & \vdots \\ \frac{\partial f_m}{\partial x_1} & \frac{\partial f_m}{\partial x_2} & \dots & \frac{\partial f_m}{\partial x_n} \\ \end{bmatrix} \end{align}

### Directional Derivative

Given the function $f: \mathbb{R}^n \rightarrow \mathbb{R}$, the directional derivative at point $\mathbf{x}$ in direction $\mathbf{h}$ is defined as

$$\nabla_{\mathbf{h}} f(\mathbf{x}) = \lim_{t \rightarrow 0} \frac{f(\mathbf{x} + t\mathbf{h}) - f(\mathbf{x})}{t}$$

If the function is differentiable at $x$, we have

\begin{align} \nabla_{\mathbf{h}} f(\mathbf{x}) &= \lim_{t \rightarrow 0} \frac{f(\mathbf{x} + t\mathbf{h}) - f(\mathbf{x})}{t} \\ &= \lim_{t \rightarrow 0} \frac{f(\mathbf{x} + t\mathbf{h}) - f(\mathbf{x})}{t\mathbf{h}} \mathbf{h} \\ &= \bigg( \lim_{t \rightarrow 0} \frac{f(\mathbf{x} + t\mathbf{h}) - f(\mathbf{x})}{t\mathbf{h}} \bigg) \mathbf{h}\\ &= \bigg( \lim_{t\mathbf{h} \rightarrow 0} \frac{f(\mathbf{x} + t\mathbf{h}) - f(\mathbf{x})}{t\mathbf{h}} \bigg) \mathbf{h}\\ &= \nabla f(\mathbf{x})^{\top} \mathbf{h}\\ \end{align}

Namely,

\begin{align} \nabla_{\mathbf{h}} f(\mathbf{x}) &= \nabla f(\mathbf{x})^{\top} \mathbf{h}\\ &= \nabla f(\mathbf{x}) \cdot \mathbf{h}\\ \end{align}

Notice that $\nabla f(\mathbf{x}) \cdot \mathbf{h} \in \mathbb{R}$.

## Automatic Differentiation

Automatic differentiation has two distinct modes, forward mode (or forward accumulation) and reverse mode (or reverse accumulation). Sometimes, people who are familiar with deep learning but not familiar with automatic differentiation would thought automatic differentiation forward mode and reverse mode are just the forward propagation and backward propagation commonly mentioned in neural network training. However, this is completely wrong. The common automatic differentiation we benefited from deep learning frameworks, such as PyTorch and TensorFlow, is actually the reverse mode of automatic differentiation. The forward propagation and backward propagation only exist in the context of automatic differentiation reverse mode.

The fundamentals of automatic differentiation is the gradient decomposition provided by the chain rule. Suppose we have the following computation that consists of three differentiable functions $f$, $g$ and $h$,

$$y = f(g(h(x)))$$

We define new variables

\begin{align} w_0 &= x \\ w_1 &= h(w_0) \\ w_2 &= g(w_1) \\ w_3 &= f(w_2) = y \\ \end{align}

According to the chain rule, using Leibniz’s notation,

\begin{align} \frac{dy}{dx} &= \frac{d{w_3}}{d{w_0}} \\ &= \frac{d{w_3}}{d{w_2}} \frac{d{w_2}}{d{w_1}} \frac{d{w_1}}{d{w_0}} \\ &= \frac{d{f(w_2)}}{d{w_2}} \frac{d{g(w_1)}}{d{w_1}} \frac{d{h(w_0)}}{d{w_0}} \\ \end{align}

Notice that $\frac{d{w_3}}{d{w_2}}$ is a function of $w_2$, $\frac{d{w_2}}{d{w_1}}$ is a function of $w_1$, and $\frac{d{w_1}}{d{w_0}}$ is a function of $w_0$.

The chain rule, however, did not specify the order of computing each of the gradients in the decomposition. The order of computing each of the gradients in automatic differentiation determines if it is forward mode or reverse mode.

In this example, the forward mode computes the recursive relation

$$\frac{d{w_i}}{d{w_0}} = \frac{d{w_i}}{d{w_{i-1}}} \frac{d{w_{i-1}}}{d{w_0}}$$

starting from $i = 3$. More concretely,

\begin{align} \frac{dy}{dx} &= \left(\frac{d{y}}{d{w_3}} \left(\frac{d{w_3}}{dx}\right)\right)\\ &= \left(\frac{d{y}}{d{w_3}} \left(\frac{d{w_3}}{d{w_2}} \left(\frac{d{w_2}}{dx} \right) \right) \right)\\ &= \left(\frac{d{y}}{d{w_3}} \left(\frac{d{w_3}}{d{w_2}} \left(\frac{d{w_2}}{d{w_1}} \left( \frac{d{w_1}}{dx} \right) \right) \right) \right)\\ &= \left(\frac{d{y}}{d{w_3}} \left(\frac{d{w_3}}{d{w_2}} \left(\frac{d{w_2}}{d{w_1}} \left( \frac{d{w_1}}{d{w_0}} \left( \frac{d{w_0}}{dx} \right) \right) \right) \right) \right)\\ \end{align}

So in the computation graph, the computation order is as follows.

$$\left\{x, 1, 1\right\} \rightarrow \left\{w_0, \frac{d{w_0}}{d{x}}, \frac{d{w_0}}{d{x}}\right\} \rightarrow \left\{ w_1 , \frac{d{w_1}}{d{w_0}}, \frac{d{w_1}}{d{x}} \right\} \rightarrow \left\{ w_2 , \frac{d{w_2}}{d{w_1}}, \frac{d{w_2}}{d{x}} \right\} \rightarrow \left\{ w_3 , \frac{d{w_3}}{d{w_2}}, \frac{d{w_3}}{d{x}} \right\} \rightarrow \left\{y, \frac{dy}{d{w_3}}, \frac{dy}{dx}\right\}$$

where the three values in the tuples are the local input value to the local function, derivative of the local function with respect to the local input value, and the accumulated derivative of the global function with respect to the global input.

We could see that the gradient computation is completed in one sweep completely following the evaluation computation flow. That’s why it is called forward mode.

The reverse mode computes the recursive relation

$$\frac{d{w_3}}{d{w_i}} = \frac{d{w_3}}{d{w_{i+1}}} \frac{d{w_{i+1}}}{d{w_i}}$$

starting from $i = 0$. More concretely,

\begin{align} \frac{dy}{dx} &= \left(\left(\frac{d{y}}{d{w_0}} \right) \frac{d{w_0}}{dx}\right)\\ &= \left(\left(\left(\frac{d{y}}{d{w_1}} \right) \frac{d{w_1}}{d{w_0}}\right) \frac{d{w_0}}{dx}\right)\\ &= \left(\left(\left( \left(\frac{d{y}}{d{w_2}} \right) \frac{d{w_2}}{d{w_1}} \right) \frac{d{w_1}}{d{w_0}}\right) \frac{d{w_0}}{dx}\right)\\ &= \left(\left(\left( \left(\left(\frac{d{y}}{d{w_3}} \right) \frac{d{w_3}}{d{w_2}} \right) \frac{d{w_2}}{d{w_1}} \right) \frac{d{w_1}}{d{w_0}}\right) \frac{d{w_0}}{dx}\right)\\ \end{align}

So in the computation graph, the computation order has to be separated into two phases, forward phase (forward propagation) and backward phase (backward propagation), and the computation order is as follows.

Forward Phase

$$\left\{x\right\} \rightarrow \left\{w_0\right\} \rightarrow \left\{ w_1 \right\} \rightarrow \left\{ w_2 \right\} \rightarrow \left\{ w_3 \right\} \rightarrow \left\{y \right\}$$

where the value in the tuple is the local input value to the local function. The output values computed from the forward phase have to be stored and used in the backward phase.

Backward Phase

$$\left\{\frac{dy}{d{w_3}}, \frac{dy}{d{w_3}}\right\} \rightarrow \left\{\frac{d{w_3}}{d{w_2}}, \frac{dy}{d{w_2}}\right\} \rightarrow \left\{\frac{d{w_2}}{d{w_1}}, \frac{dy}{d{w_1}}\right\} \rightarrow \left\{\frac{d{w_1}}{d{w_0}}, \frac{dy}{d{w_0}}\right\} \rightarrow \left\{\frac{d{w_0}}{d{x}}, \frac{dy}{d{x}}\right\}$$

where the tuples in the tuple are the derivative of the local function with respect to the local input value, and the accumulated derivative of the global function with respect to the global input.

We have considered a case for a simple function $f: \mathbb{R} \rightarrow \mathbb{R}$. However, in practice, $f$ could be more complicated $f: \mathbb{R}^{n} \rightarrow \mathbb{R}^{m}$. Let’s discuss about the more general cases.

## Forward Mode

### Evaluation Efficiency

Let’s first consider the case where $n = 1$, i.e., $f: \mathbb{R} \rightarrow \mathbb{R}^{m}$. In the forward mode evaluation, we would only need to compute the accumulation of all the derivatives $\left\{ \frac{d{y_1}}{dx}, \frac{d{y_2}}{dx}, \cdots, \frac{d{y_m}}{dx} \right\}$ at the output nodes on the computation graph. There is no additional cost of intermediate node memory.

Let’s then consider the case where $m = 1$, i.e., $f: \mathbb{R}^{n} \rightarrow \mathbb{R}$. In the forward mode evaluation, we would have to keep the accumulated derivatives for different inputs in the intermediate nodes, for instance $\left\{ \frac{d{w_3}}{d{x_1}}, \frac{d{w_3}}{d{x_4}}, \frac{d{w_3}}{d{x_8}} \right\}$. Depending on the implementation, this additional cost of intermediate node memory could be $O(n)$ for one node. So in practice, a more cost-effective way is to do $n$ forward mode evaluations to evaluate $\frac{d{y}}{d{x_1}}, \frac{d{y}}{d{x_2}}, \cdots, \frac{d{y}}{d{x_n}}$ individually. This would not bring additional cost of intermediate node memory.

Let’s finally consider the general case of function $f: \mathbb{R}^{n} \rightarrow \mathbb{R}^{m}$. Based on the previous two cases, it is not hard to find out that the cost-effective way is to do $n$ forward mode evaluations to evaluate $\left\{ \frac{d{y_1}}{d{x_1}}, \frac{d{y_2}}{d{x_1}}, \cdots, \frac{d{y_m}}{d{x_1}} \right\}$, $\left\{ \frac{d{y_1}}{d{x_2}}, \frac{d{y_2}}{d{x_2}}, \cdots, \frac{d{y_m}}{d{x_2}} \right\}$, $\cdots$, $\left\{ \frac{d{y_1}}{d{x_n}}, \frac{d{y_2}}{d{x_n}}, \cdots, \frac{d{y_m}}{d{x_n}} \right\}$, individually.

### Forward Mode Evaluation is Jacobian-Vector Product

The forward mode computation graph framework evaluates $\mathbf{J}_{f} \mathbf{v}$ where $\mathbf{J}_{f}$ is the Jacobian matrix mentioned in the prerequisites and $\mathbf{v} \in \mathbb{R}^{n}$. Essentially, if we give the forward mode computation graph framework an input of $\mathbf{v} = \left[ v_1, v_2, \cdots, v_n \right]^{\top}$, the framework returns us the Jacobian-vector product $\mathbf{J}_{f} \mathbf{v}$.

\begin{align} \mathbf{J}_{f} \mathbf{v} &= \begin{bmatrix} \frac{\partial f_1}{\partial x_1} & \frac{\partial f_1}{\partial x_2} & \dots & \frac{\partial f_1}{\partial x_n} \\ \frac{\partial f_2}{\partial x_1} & \frac{\partial f_2}{\partial x_2} & \dots & \frac{\partial f_2}{\partial x_n} \\ \vdots & \vdots & \ddots & \vdots \\ \frac{\partial f_m}{\partial x_1} & \frac{\partial f_m}{\partial x_2} & \dots & \frac{\partial f_m}{\partial x_n} \\ \end{bmatrix} \begin{bmatrix} v_1 \\ v_2 \\ \vdots \\ v_n \\ \end{bmatrix} \\ &= \begin{bmatrix} \sum_{i=1}^{n} \frac{\partial f_1}{\partial x_i} v_i \\ \sum_{i=1}^{n} \frac{\partial f_2}{\partial x_i} v_i \\ \vdots \\ \sum_{i=1}^{n} \frac{\partial f_m}{\partial x_i} v_i \\ \end{bmatrix} \\ \end{align}

It should be worth noting that the forward mode computation graph framework implementation never computes the Jacobian matrix $\mathbf{J}_{f}$ and could return the Jacobian-vector product $\mathbf{J}_{f} \mathbf{v}$. This is sometimes referred as the matrix-free computation.

Of course, the full Jacobian could be computed using the forward mode computation graph framework by running the evaluation $n$ times using the standard basis (one-hot) vectors $\mathbf{e}_1, \mathbf{e}_2, \cdots, \mathbf{e}_n$ where $\mathbb{e}_{i,j} \in \mathbf{R}^n$, $\mathbf{e}_{i,j} = 1$ for $j = i$ and $\mathbf{e}_{i,j} = 0$ for $j \neq i$. Alternatively, we could say

$$\mathbf{e}_{i} = \begin{bmatrix} \frac{\partial x_1}{\partial x_i} \\ \frac{\partial x_2}{\partial x_i} \\ \vdots \\ \frac{\partial x_n}{\partial x_i} \\ \end{bmatrix} \\$$

For example,

\begin{align} \mathbf{J}_{f} \mathbf{\mathbf{e}}_1 &= \begin{bmatrix} \frac{\partial f_1}{\partial x_1} & \frac{\partial f_1}{\partial x_2} & \dots & \frac{\partial f_1}{\partial x_n} \\ \frac{\partial f_2}{\partial x_1} & \frac{\partial f_2}{\partial x_2} & \dots & \frac{\partial f_2}{\partial x_n} \\ \vdots & \vdots & \ddots & \vdots \\ \frac{\partial f_m}{\partial x_1} & \frac{\partial f_m}{\partial x_2} & \dots & \frac{\partial f_m}{\partial x_n} \\ \end{bmatrix} \begin{bmatrix} 1 \\ 0 \\ \vdots \\ 0 \\ \end{bmatrix} \\ &= \begin{bmatrix} \frac{\partial f_1}{\partial x_1} \\ \frac{\partial f_2}{\partial x_1} \\ \vdots \\ \frac{\partial f_m}{\partial x_1} \\ \end{bmatrix} \end{align}

### Machine Learning Implications

In machine learning, usually there is only one output in the objective function $f$ during training (model optimization), i.e., $f: \mathbb{R}^{n} \rightarrow \mathbb{R}$. To optimize model, we will have to compute the Jacobian matrix $\mathbf{J}_{f}$. The inputs to the objective function, which sometimes might be confused by beginners, are the model parameters $\boldsymbol{\theta} \in \mathbb{R}^{n}$. In practice, $n$ could be extremely large, especially in deep learning. Therefore, computing the accurate Jacobian matrix $\mathbf{J}_{f}(\boldsymbol{\theta})$ using automatic differentiation forward mode would be asymptotically slow.

\begin{align} \mathbf{J}_{f}(\boldsymbol{\theta}) &= \nabla f(\boldsymbol{\theta})^{\top} \\ &= \begin{bmatrix} \frac{\partial f}{\partial \theta_1} & \frac{\partial f}{\partial \theta_2} & \dots & \frac{\partial f}{\partial \theta_n} \\ \end{bmatrix} \end{align}

where we define

\begin{align} \nabla f(\boldsymbol{\theta}) &= \begin{bmatrix} \frac{\partial f}{\partial \theta_1} \\ \frac{\partial f}{\partial \theta_2} \\ \vdots \\ \frac{\partial f}{\partial \theta_n} \\ \end{bmatrix} \end{align}

The Jacobian-vector project becomes

\begin{align} \mathbf{J}_{f}(\boldsymbol{\theta})\mathbf{v} &= \nabla f(\boldsymbol{\theta})^{\top}\mathbf{v} \\ &= \nabla f(\boldsymbol{\theta}) \cdot \mathbf{v} \end{align}

By the definition, this is just the directional derivative of function $f(\boldsymbol{\theta})$ at point $\mathbf{x}$ in direction $\mathbf{v}$, and $\mathbf{J}_{f}(\boldsymbol{\theta})\mathbf{v} \in \mathbb{R}$.

## Reverse Mode

### Evaluation Efficiency

Let’s first consider the case where $m = 1$, i.e., $f: \mathbb{R}^{n} \rightarrow \mathbb{R}$. Just like the $n = 1$ case for the forward mode evaluation, in the reverse mode evaluation, we would only need to compute the accumulation of all the derivatives $\left\{ \frac{d{y}}{d{x_1}}, \frac{d{y}}{d{x_2}}, \cdots, \frac{d{y}}{d{x_n}} \right\}$ at the output nodes on the computation graph. There is no additional cost of intermediate node memory.

Let’s then consider the case where $n = 1$, i.e., $f: \mathbb{R} \rightarrow \mathbb{R}^{m}$. Just like the $m = 1$ case for the forward mode evaluation, in the reverse mode evaluation, we would have to keep the accumulated derivatives for different outputs in the intermediate nodes. This additional cost of intermediate node memory, similarly, could be $O(m)$ for one node. So in practice, a more cost-effective way is to do $m$ reverse mode evaluations to evaluate $\frac{d{y_1}}{d{x}}, \frac{d{y_2}}{d{x}}, \cdots, \frac{d{y_m}}{d{x}}$ individually. This would not bring additional cost of intermediate node memory.

Let’s finally consider the general case of function $f: \mathbb{R}^{n} \rightarrow \mathbb{R}^{m}$. Based on the previous two cases, it is not hard to find out that the cost-effective way is to do $m$ reverse mode evaluations to evaluate $\left\{ \frac{d{y_1}}{d{x_1}}, \frac{d{y_1}}{d{x_2}}, \cdots, \frac{d{y_1}}{d{x_n}} \right\}$, $\left\{ \frac{d{y_2}}{d{x_1}}, \frac{d{y_2}}{d{x_2}}, \cdots, \frac{d{y_2}}{d{x_n}} \right\}$, $\cdots$, $\left\{ \frac{d{y_m}}{d{x_1}}, \frac{d{y_m}}{d{x_2}}, \cdots, \frac{d{y_m}}{d{x_n}} \right\}$, individually.

### Reverse Mode Evaluation is Vector-Jacobian Product

The reverse mode computation graph framework evaluates $\mathbf{v}^{\top} \mathbf{J}_{f}$ where $\mathbf{J}_{f}$ is the Jacobian matrix mentioned in the prerequisites and $\mathbf{v} \in \mathbb{R}^{m}$. Essentially, if we give the reverse mode computation graph framework an input of $\mathbf{v} = \left[ v_1, v_2, \cdots, v_m \right]^{\top}$, the framework returns us the Jacobian-vector product $\mathbf{v}^{\top} \mathbf{J}_{f}$.

\begin{align} \mathbf{v}^{\top} \mathbf{J}_{f} &= \begin{bmatrix} v_1 & v_2 & \dots & v_m \end{bmatrix} \begin{bmatrix} \frac{\partial f_1}{\partial x_1} & \frac{\partial f_1}{\partial x_2} & \dots & \frac{\partial f_1}{\partial x_n} \\ \frac{\partial f_2}{\partial x_1} & \frac{\partial f_2}{\partial x_2} & \dots & \frac{\partial f_2}{\partial x_n} \\ \vdots & \vdots & \ddots & \vdots \\ \frac{\partial f_m}{\partial x_1} & \frac{\partial f_m}{\partial x_2} & \dots & \frac{\partial f_m}{\partial x_n} \\ \end{bmatrix} \\ &= \begin{bmatrix} \sum_{i=1}^{m} \frac{\partial f_i}{\partial x_1} v_i & \sum_{i=1}^{m} \frac{\partial f_i}{\partial x_2} v_i & \dots & \sum_{i=1}^{m} \frac{\partial f_i}{\partial x_m} v_i \end{bmatrix} \\ \end{align}

Again, it should be worth noting that the reverse mode computation graph framework implementation never computes the Jacobian matrix $\mathbf{J}_{f}$ and could return the vector-Jacobian product $\mathbf{v}^{\top} \mathbf{J}_{f}$.

Of course, the full Jacobian could be computed using the reversed mode computation graph framework by running the evaluation $m$ times using the standard basis (one-hot) vectors $\mathbf{e}_1, \mathbf{e}_2, \cdots, \mathbf{e}_m$ where $\mathbb{e}_{i,j} \in \mathbf{R}^m$, $\mathbf{e}_{i,j} = 1$ for $j = i$ and $\mathbf{e}_{i,j} = 0$ for $j \neq i$. Alternatively, we could say

$$\mathbf{e}_{i} = \begin{bmatrix} \frac{\partial f_1}{\partial f_i} \\ \frac{\partial f_2}{\partial f_i} \\ \vdots \\ \frac{\partial f_m}{\partial f_i} \\ \end{bmatrix} \\$$

For example,

\begin{align} \mathbf{\mathbf{e}}_1 ^{\top} \mathbf{J}_{f} &= \begin{bmatrix} 1 & 0 & \dots & 0 \end{bmatrix} \begin{bmatrix} \frac{\partial f_1}{\partial x_1} & \frac{\partial f_1}{\partial x_2} & \dots & \frac{\partial f_1}{\partial x_n} \\ \frac{\partial f_2}{\partial x_1} & \frac{\partial f_2}{\partial x_2} & \dots & \frac{\partial f_2}{\partial x_n} \\ \vdots & \vdots & \ddots & \vdots \\ \frac{\partial f_m}{\partial x_1} & \frac{\partial f_m}{\partial x_2} & \dots & \frac{\partial f_m}{\partial x_n} \\ \end{bmatrix} \\ &= \begin{bmatrix} \frac{\partial f_1}{\partial x_1} & \frac{\partial f_1}{\partial x_2} & \dots & \frac{\partial f_1}{\partial x_n} \\ \end{bmatrix} \end{align}

### Machine Learning Implications

In machine learning, usually there is only one output in the objective function $f$ during training (model optimization), i.e., $f: \mathbb{R}^{n} \rightarrow \mathbb{R}$. This time, even though $n$ could be extremely large, we could compute the accurate Jacobian matrix $\mathbf{J}_{f}(\boldsymbol{\theta})$ using automatic differentiation reverse mode in just one run. This is why in almost all the deep learning frameworks, the computation of the gradients were implemented as the automatic differentiation reverse mode.

\begin{align} \mathbf{J}_{f}(\boldsymbol{\theta}) &= \nabla f(\boldsymbol{\theta})^{\top} \\ &= \begin{bmatrix} \frac{\partial f}{\partial \theta_1} & \frac{\partial f}{\partial \theta_2} & \dots & \frac{\partial f}{\partial \theta_n} \\ \end{bmatrix} \end{align}

where we define

\begin{align} \nabla f(\boldsymbol{\theta}) &= \begin{bmatrix} \frac{\partial f}{\partial \theta_1} \\ \frac{\partial f}{\partial \theta_2} \\ \vdots \\ \frac{\partial f}{\partial \theta_n} \\ \end{bmatrix} \end{align}

The vector-Jacobian product becomes

\begin{align} \mathbf{\mathbf{v}} ^{\top} \mathbf{J}_{f}(\boldsymbol{\theta}) &= v \nabla f(\boldsymbol{\theta})^{\top} \\ \end{align}

When $v = 1$, the vector-Jacobian product $\mathbf{\mathbf{v}} ^{\top} \mathbf{J}_{f}(\boldsymbol{\theta})$ is just the $f(\boldsymbol{\theta})^{\top}$.

## Forward Mode VS Reverse Mode

In general, for $f: \mathbb{R}^n \rightarrow \mathbb{R}^m$, suppose the function $f(\mathbf{x})$ could be evaluated in time $t$, the Jacobian matrix $\mathbf{J}_{f}(\mathbf{x})$ could be evaluated in

• $O(nt)$ with forward mode
• $O(mt)$ with reverse mode

The reverse mode performs much better when $n \gg m$, which is usually the case in modern machine learning, including both the discriminative and generative models.

## Automatic Differentiation Implementation for Neural Networks

During neural network optimization, the optimization function usually only has one scalar output, i.e., $f: \mathbb{R}^{n} \rightarrow \mathbb{R}$. Suppose the neural network has parameters parameters $\boldsymbol{\theta} \in \mathbb{R}^{n}$.

Given a vector $\mathbf{v} \in \mathbb{R}^{n}$, the goal for the forward mode is to compute a scalar $\mathbf{J}_{f}(\boldsymbol{\theta})\mathbf{v} \in \mathbb{R}$.

\begin{align} \mathbf{J}_{f}(\boldsymbol{\theta})\mathbf{v} &= \nabla f(\boldsymbol{\theta})^{\top}\mathbf{v} \\ &= \nabla f(\boldsymbol{\theta}) \cdot \mathbf{v} \\ &= \sum_{i=1}^{n} \frac{\partial f}{\partial \theta_i} v_i \\ \end{align}

Given a scalar $v \in \mathbb{R}$, the goal for the reverse mode is to compute a vector $v\mathbf{J}_{f}(\boldsymbol{\theta}) \in \mathbb{R}^n$.

\begin{align} \mathbf{\mathbf{v}} ^{\top} \mathbf{J}_{f}(\boldsymbol{\theta}) &= v \nabla f(\boldsymbol{\theta})^{\top} \\ &= \begin{bmatrix} \frac{\partial f}{\partial \theta_1} v & \frac{\partial f}{\partial \theta_2} v & \dots & \frac{\partial f}{\partial \theta_n} v \\ \end{bmatrix} \end{align}

Let’s see how the gradients were propagated during optimization for both the forward mode and the reverse mode.

### Forward Mode

For any neural network layer $i$, it is a function $f_{i}: \mathbb{R}^{a_{i}} \times \mathbb{R}^{c_{i}} \rightarrow \mathbb{R}^{b_{i}}$. It has a layer primal input $\mathbf{x}_{i} \in \mathbb{R}^{a_{i}}$, optionally a layer primal parameter $\boldsymbol{\theta}_{i} \in \mathbb{R}^{c_{i}}$. The layer evaluates $f(\mathbf{x}_{i})$ and produce a layer primal output $\mathbf{y}_{i} \in \mathbb{R}^{b_{i}}$. In the forward mode, we also have a layer tangent input $\mathbf{x}^{\prime}_{i} \in \mathbb{R}^{a_{i}}$, optionally a layer tangent parameter $\mathbf{v}_{i} \in \mathbb{R}^{c_{i}}$ and produce a layer tangent output $\mathbf{y}^{\prime}_{i} \in \mathbb{R}^{b_{i}}$.

More specifically, suppose the neural network layers are sequential, the layer tangent input $\mathbf{x}^{\prime}_{i} \in \mathbb{R}^{a_{i}}$ is the Jacobian-vector product computed from the previous layers $\{1, 2, \cdots, i-1\}$.

\begin{align} \mathbf{x}^{\prime}_{i} &= \mathbf{J}_{f_{\{1,2,\cdots, i-1\}}}(\boldsymbol{\theta}_{1}, \boldsymbol{\theta}_{2}, \cdots, \boldsymbol{\theta}_{i-1}) \begin{bmatrix} \mathbf{v}_{1} \\ \mathbf{v}_{2} \\ \vdots \\ \mathbf{v}_{i-1} \\ \end{bmatrix} \\ &= \begin{bmatrix} \sum_{j=1}^{i-1} \sum_{k=1}^{c_{i-1}} \frac{\partial \mathbf{x}_{i, 1}}{\partial \boldsymbol{\theta}_{j,k}} \mathbf{v}_{j,k} \\ \sum_{j=1}^{i-1} \sum_{k=1}^{c_{i-1}} \frac{\partial \mathbf{x}_{i, 2}}{\partial \boldsymbol{\theta}_{j,k}} \mathbf{v}_{j,k} \\ \vdots \\ \sum_{j=1}^{i-1} \sum_{k=1}^{c_{i-1}} \frac{\partial \mathbf{x}_{i, a_i}}{\partial \boldsymbol{\theta}_{j,k}} \mathbf{v}_{j,k} \\ \end{bmatrix} \end{align}

The layer tangent output $\mathbf{y}^{\prime}_{i} \in \mathbb{R}^{b_{i}}$ is the Jacobian-vector product computed from the layers $\{1, 2, \cdots, i-1, i\}$.

\begin{align} \mathbf{y}^{\prime}_{i} &= \mathbf{J}_{f_{\{1,2,\cdots, i-1, i\}}}(\boldsymbol{\theta}_{1}, \boldsymbol{\theta}_{2}, \cdots, \boldsymbol{\theta}_{i-1}, \boldsymbol{\theta}_{i}) \begin{bmatrix} \mathbf{v}_{1} \\ \mathbf{v}_{2} \\ \vdots \\ \mathbf{v}_{i-1} \\ \mathbf{v}_{i} \\ \end{bmatrix} \\ &= \begin{bmatrix} \sum_{j=1}^{i} \sum_{k=1}^{c_{i}} \frac{\partial \mathbf{y}_{i, 1}}{\partial \boldsymbol{\theta}_{j,k}} \mathbf{v}_{j,k} \\ \sum_{j=1}^{i} \sum_{k=1}^{c_{i}} \frac{\partial \mathbf{y}_{i, 2}}{\partial \boldsymbol{\theta}_{j,k}} \mathbf{v}_{j,k} \\ \vdots \\ \sum_{j=1}^{i} \sum_{k=1}^{c_{i}} \frac{\partial \mathbf{y}_{i, b_i}}{\partial \boldsymbol{\theta}_{j,k}} \mathbf{v}_{j,k} \\ \end{bmatrix} \\ &= \begin{bmatrix} \sum_{k=1}^{c_{i}} \frac{\partial \mathbf{y}_{i, 1}}{\partial \boldsymbol{\theta}_{i,k}} \mathbf{v}_{i,k} + \sum_{j=1}^{i-1} \sum_{k=1}^{c_{i-1}} \frac{\partial \mathbf{y}_{i, 1}}{\partial \boldsymbol{\theta}_{j,k}} \mathbf{v}_{j,k} \\ \sum_{k=1}^{c_{i}} \frac{\partial \mathbf{y}_{i, 2}}{\partial \boldsymbol{\theta}_{i,k}} \mathbf{v}_{i,k} + \sum_{j=1}^{i-1} \sum_{k=1}^{c_{i-1}} \frac{\partial \mathbf{y}_{i, 2}}{\partial \boldsymbol{\theta}_{j,k}} \mathbf{v}_{j,k} \\ \vdots \\ \sum_{k=1}^{c_{i}} \frac{\partial \mathbf{y}_{i, b_i}}{\partial \boldsymbol{\theta}_{i,k}} \mathbf{v}_{i,k} + \sum_{j=1}^{i-1} \sum_{k=1}^{c_{i-1}} \frac{\partial \mathbf{y}_{i, b_i}}{\partial \boldsymbol{\theta}_{j,k}} \mathbf{v}_{j,k} \\ \end{bmatrix} \\ &= \begin{bmatrix} \sum_{k=1}^{c_{i}} \frac{\partial \mathbf{y}_{i, 1}}{\partial \boldsymbol{\theta}_{i,k}} \mathbf{v}_{i,k} \\ \sum_{k=1}^{c_{i}} \frac{\partial \mathbf{y}_{i, 2}}{\partial \boldsymbol{\theta}_{i,k}} \mathbf{v}_{i,k} \\ \vdots \\ \sum_{k=1}^{c_{i}} \frac{\partial \mathbf{y}_{i, b_i}}{\partial \boldsymbol{\theta}_{i,k}} \mathbf{v}_{i,k} \\ \end{bmatrix} + \begin{bmatrix} \sum_{j=1}^{i-1} \sum_{k=1}^{c_{i-1}} \frac{\partial \mathbf{y}_{i, 1}}{\partial \boldsymbol{\theta}_{j,k}} \mathbf{v}_{j,k} \\ \sum_{j=1}^{i-1} \sum_{k=1}^{c_{i-1}} \frac{\partial \mathbf{y}_{i, 2}}{\partial \boldsymbol{\theta}_{j,k}} \mathbf{v}_{j,k} \\ \vdots \\ \sum_{j=1}^{i-1} \sum_{k=1}^{c_{i-1}} \frac{\partial \mathbf{y}_{i, b_i}}{\partial \boldsymbol{\theta}_{j,k}} \mathbf{v}_{j,k} \\ \end{bmatrix} \\ &= \mathbf{J}_{f_i}(\boldsymbol{\theta}_i)\mathbf{v}_{i} + \begin{bmatrix} \frac{\partial \mathbf{y}_{i, 1}}{\partial \mathbf{x}_{i, 1}} & \frac{\partial \mathbf{y}_{i, 1}}{\partial \mathbf{x}_{i, 2}} & \dots & \frac{\partial \mathbf{y}_{i, 1}}{\partial \mathbf{x}_{i, a_i}} \\ \frac{\partial \mathbf{y}_{i, 2}}{\partial \mathbf{x}_{i, 1}} & \frac{\partial \mathbf{y}_{i, 2}}{\partial \mathbf{x}_{i, 2}} & \dots & \frac{\partial \mathbf{y}_{i, 2}}{\partial \mathbf{x}_{i, a_i}} \\ \vdots & \vdots & \ddots & \vdots \\ \frac{\partial \mathbf{y}_{i, b_i}}{\partial \mathbf{x}_{i, 1}} & \frac{\partial \mathbf{y}_{i, b_i}}{\partial \mathbf{x}_{i, 2}} & \dots & \frac{\partial \mathbf{y}_{i, b_i}}{\partial \mathbf{x}_{i, a_i}} \\ \end{bmatrix} \begin{bmatrix} \sum_{j=1}^{i-1} \sum_{k=1}^{c_{i-1}} \frac{\partial \mathbf{x}_{i, 1}}{\partial \boldsymbol{\theta}_{j,k}} \mathbf{v}_{j,k} \\ \sum_{j=1}^{i-1} \sum_{k=1}^{c_{i-1}} \frac{\partial \mathbf{x}_{i, 2}}{\partial \boldsymbol{\theta}_{j,k}} \mathbf{v}_{j,k} \\ \vdots \\ \sum_{j=1}^{i-1} \sum_{k=1}^{c_{i-1}} \frac{\partial \mathbf{x}_{i, a_i}}{\partial \boldsymbol{\theta}_{j,k}} \mathbf{v}_{j,k} \\ \end{bmatrix} \\ &= \mathbf{J}_{f_i}(\boldsymbol{\theta}_i)\mathbf{v}_{i} + \mathbf{J}_{f_i}(\mathbf{x}_i)\mathbf{x}^{\prime}_{i} \end{align}

In fact, it is just the sum of the two Jacobian-vector product from the layer $i$. One treats the layer parameter $\boldsymbol{\theta}_{i}$ as function input variables and the layer input $\mathbf{x}_{i}$ as constant. The other one treats the layer parameter $\boldsymbol{\theta}_{i}$ as constant and the layer input $\mathbf{x}_{i}$ as function input variables. This also extends naturally to situations where there are multiple layer inputs and layer parameters.

With this equation, we could compute the Jacobian-vector product $\mathbf{J}_{f}(\boldsymbol{\theta})\mathbf{v}$ as the neural network gets evaluated forward. One final caveat is that, for the first layer, layer $1$, the tangent layer input $\mathbf{x}^{\prime}_{1}$ has to be $\mathbf{0}$.

### Reverse Mode

For any neural network layer $i$, it is a function $f_{i}: \mathbb{R}^{a_{i}} \times \mathbb{R}^{c_{i}} \rightarrow \mathbb{R}^{b_{i}}$. It has a layer primal input $\mathbf{x}_{i} \in \mathbb{R}^{a_{i}}$, optionally a layer primal parameter $\boldsymbol{\theta}_{i} \in \mathbb{R}^{c_{i}}$. There are no tangent input, parameter, and output required for reverse mode. Instead, there are gradients. We have a layer input gradient $\mathbf{y}^{\prime}_{i} \in \mathbb{R}^{b_{i}}$, optionally a layer parameter gradient $\mathbf{\boldsymbol{\theta}}_{i} \in \mathbb{R}^{c_{i}}$ and produce a layer output gradient $\mathbf{x}^{\prime}_{i} \in \mathbb{R}^{a_{i}}$.

More specifically,

\begin{align} \mathbf{y}^{\prime \top}_{i} &= v \nabla f_{\{i+1, i+2, \cdots\}}(\mathbf{y}_{i} )^{\top} \\ &= \begin{bmatrix} \frac{\partial f}{\partial \mathbf{y}_{i, 1}} v & \frac{\partial f}{\partial \mathbf{y}_{i, 2}} v & \dots & \frac{\partial f}{\partial \mathbf{y}_{i, b_{i}}} v \\ \end{bmatrix} \end{align}

The layer output gradient $\mathbf{x}^{\prime}_{i} \in \mathbb{R}^{a_{i}}$ and layer parameter gradient $\mathbf{\boldsymbol{\theta}}_{i} \in \mathbb{R}^{c_{i}}$ are the vector-Jacobian product computed from the layers $\{i, i+1, \cdots \}$.

\begin{align} \mathbf{x}^{\prime \top}_{i} &= v \nabla f_{\{i, i+1, i+2, \cdots\}}(\mathbf{x}_{i} )^{\top} \\ &= \begin{bmatrix} \frac{\partial f}{\partial \mathbf{x}_{i, 1}} v & \frac{\partial f}{\partial \mathbf{x}_{i, 2}} v & \dots & \frac{\partial f}{\partial \mathbf{x}_{i, a_{i}}} v \\ \end{bmatrix} \\ &= \begin{bmatrix} \frac{\partial f}{\partial \mathbf{y}_{i, 1}} v & \frac{\partial f}{\partial \mathbf{y}_{i, 2}} v & \dots & \frac{\partial f}{\partial \mathbf{y}_{i, b_{i}}} v \\ \end{bmatrix} \begin{bmatrix} \frac{\partial \mathbf{y}_{i, 1}}{\partial \mathbf{x}_{i, 1}} & \frac{\partial \mathbf{y}_{i, 1}}{\partial \mathbf{x}_{i, 2}} & \dots & \frac{\partial \mathbf{y}_{i, 1}}{\partial \mathbf{x}_{i, a_i}} \\ \frac{\partial \mathbf{y}_{i, 2}}{\partial \mathbf{x}_{i, 1}} & \frac{\partial \mathbf{y}_{i, 2}}{\partial \mathbf{x}_{i, 2}} & \dots & \frac{\partial \mathbf{y}_{i, 2}}{\partial \mathbf{x}_{i, a_i}} \\ \vdots & \vdots & \ddots & \vdots \\ \frac{\partial \mathbf{y}_{i, b_i}}{\partial \mathbf{x}_{i, 1}} & \frac{\partial \mathbf{y}_{i, b_i}}{\partial \mathbf{x}_{i, 2}} & \dots & \frac{\partial \mathbf{y}_{i, b_i}}{\partial \mathbf{x}_{i, a_i}} \\ \end{bmatrix} \\ &= \mathbf{y}^{\prime \top}_{i} \mathbf{J}_{f_i}(\mathbf{x}_i) \end{align}

Similarly,

\begin{align} \mathbf{\boldsymbol{\theta}}^{\prime \top}_{i} &= v \nabla f_{\{i, i+1, i+2, \cdots\}}(\mathbf{\boldsymbol{\theta}}_{i} )^{\top} \\ &= \begin{bmatrix} \frac{\partial f}{\partial \boldsymbol{\theta}_{i, 1}} v & \frac{\partial f}{\partial \boldsymbol{\theta}_{i, 2}} v & \dots & \frac{\partial f}{\partial \boldsymbol{\theta}_{i, c_{i}}} v \\ \end{bmatrix} \\ &= \mathbf{y}^{\prime \top}_{i} \mathbf{J}_{f_i}(\boldsymbol{\theta}_i) \end{align}

This also extends naturally to situations where there are multiple layer inputs and layer parameters.

With this equation, we could compute the vector-Jacobian product $v\mathbf{J}_{f}(\boldsymbol{\theta})$ as the neural network gets evaluated backward. Also notice that during optimization, usually $v = 1$.

Lei Mao

02-21-2022

03-17-2022