Automatic Differentiation Revisited
Introduction
Previously, I created a very lengthy article “Automatic Differentiation” to discuss how automatic differentiation, including the forward mode and the reverse mode, works in modern deep learning frameworks. The discussion could be significantly simplified by understanding the de novo chain rule expression without going into the details of how each element is computed using dot products.
In this blog post, I will revisit the automatic differentiation topic and provide a more concise explanation of how it works in modern deep learning frameworks using Jacobian matrix, de novo chain rule expression, Jacobian-vector product, and vector-Jacobian product.
Automatic Differentiation
In deep learning, we often have the following composition of functions
$$
\begin{align}
h(\mathbf{x}) &= f(\mathbf{g}(\mathbf{x})) \\
\end{align}
$$
where $f: \mathbb{R}^{m} \to \mathbb{R}$ is a loss function, $g: \mathbb{R}^{n} \to \mathbb{R}^{m}$ is a neural network, $\mathbf{x} \in \mathbb{R}^{n}$ is the neural network parameter variables to be optimized, $\mathbf{g}(\mathbf{x}) \in \mathbb{R}^{m}$ is the output of the neural network, and $f(\mathbf{g}(\mathbf{x})) \in \mathbb{R}$ is the loss value. All the data, including the reference inputs and the labels, are constants, and is not included in the function composition expression.
The chain rule becomes
$$
\begin{align}
\nabla_{\mathbf{x}} h(\mathbf{x}) &= \nabla_{\mathbf{g}} f(\mathbf{g}(\mathbf{x})) \mathbf{J}_{\mathbf{g}}(\mathbf{x}) \\
\end{align}
$$
where $\nabla_{\mathbf{x}} h(\mathbf{x})$ is the gradient of $h$ with respect to $\mathbf{x}$, which is a vector in $\mathbb{R}^{n}$. $\nabla_{\mathbf{g}} f(\mathbf{g}(\mathbf{x}))$ is the gradient of $f$ with respect to $\mathbf{g}(\mathbf{x})$, which is a vector in $\mathbb{R}^{m}$. $\mathbf{J}_{\mathbf{g}}(\mathbf{x})$ is the Jacobian matrix of $\mathbf{g}$ with respect to $\mathbf{x}$, which is a matrix in $\mathbb{R}^{m \times n}$.
Because the neural network typically has multiple layers, we have $\mathbf{x} = [\mathbf{x}_{1}, \mathbf{x}_{2}, \ldots, \mathbf{x}_{L}]$, where $L$ is the number of layers, listed in the evaluation order of the layers. Accordingly, the layer functions of the neural network are $\mathbf{g}_{1}, \mathbf{g}_{2}, \ldots, \mathbf{g}_{L}$, where $\mathbf{g}_{i}(\mathbf{x}_{i})$ is the output of the $i$-th layer.
For simplicity, assuming there are no branches in the neural network, we have
$$
\begin{align}
\mathbf{y}_{1} &= \mathbf{g}_{1}(\mathbf{x}_{1}) \\
\mathbf{y}_{2} &= \mathbf{g}_{2}(\mathbf{x}_{2}, \mathbf{y}_{1}) \\
&\vdots \\
\mathbf{y}_{L} &= \mathbf{g}_{L}(\mathbf{x}_{L}, \mathbf{y}_{L-1}) \\
L &= f(\mathbf{y}_{L}) \\
\end{align}
$$
where $\mathbf{y}_{i}$ is the output, sometimes referred as activation, of the $i$-th layer, and $\mathbf{x}_{i}$ is the parameter variables of the $i$-th layer. The output of the last layer is the input to the loss function $f$.
For each $\mathbf{x}_{i}$ from the $i$-th layer, because
$$
\begin{align}
h(\mathbf{x}_{i}) &= f(\mathbf{g}_{L}(\mathbf{x}_{L}, \mathbf{y}_{L-1})) \\
\end{align}
$$
Using the chain rule, we have
$$
\begin{align}
\nabla_{\mathbf{x}_{i}} h(\mathbf{x}) &= \nabla_{\mathbf{g}_{L}} f(\mathbf{g}_{L}(\mathbf{x}_{L}, \mathbf{y}_{L-1})) \mathbf{J}_{\mathbf{g}_{L}, \mathbf{x}_{i}}(\mathbf{x}_{L}, \mathbf{y}_{L-1}) \\
\end{align}
$$
However, the Jocobian matrix $\mathbf{J}_{\mathbf{g}_{L}, \mathbf{x}_{i}}(\mathbf{x}_{L}, \mathbf{y}_{L-1})$ cannot be computed directly because $\mathbf{g}_{L}$ does not directly depend on $\mathbf{x}_{i}$.
We notice that
$$
\begin{align}
\mathbf{g}_{L}(\mathbf{x}_{L}, \mathbf{y}_{L-1}) &= \mathbf{g}_{L}(\mathbf{x}_{L}, \mathbf{g}_{L-1}(\mathbf{x}_{L-1}, \mathbf{y}_{L-2})) \\
\end{align}
$$
Using the chain rule, we have
$$
\begin{align}
\mathbf{J}_{\mathbf{g}_{L}, \mathbf{x}_{i}}(\mathbf{x}_{L}, \mathbf{y}_{L-1}) &= \mathbf{J}_{\mathbf{g}_{L}, \mathbf{y}_{L-1}}(\mathbf{x}_{L}, \mathbf{y}_{L-1}) \mathbf{J}_{\mathbf{g}_{L-1}, \mathbf{x}_{i}}(\mathbf{x}_{L-1}, \mathbf{y}_{L-2}) \\
\end{align}
$$
Now the Jacobian matrix $\mathbf{J}_{\mathbf{g}_{L}, \mathbf{y}_{L-1}}(\mathbf{x}_{L}, \mathbf{y}_{L-1})$ can be computed because the layer function $\mathbf{g}_{L}$ directly depends on the output of the previous layer $\mathbf{y}_{L-1}$.
We will recursively perform this process until we reach the layer $i$, where the Jacobian matrix $\mathbf{J}_{\mathbf{g}_{i}, \mathbf{x}_{i}}(\mathbf{x}_{i}, \mathbf{y}_{i-1})$ can be computed directly because the layer function $\mathbf{g}_{i}$ directly depends on the parameter variables $\mathbf{x}_{i}$.
Putting all the chain rule pieces together, we have derived the formula for computing the gradient of the loss function with respect to the parameter variables $\mathbf{x}_{i}$ of the $i$-th layer of the neural network, which is essentially a series of matrix multiplications of the Jacobian matrices of the layer functions.
$$
\begin{align}
\nabla_{\mathbf{x}_{i}} h(\mathbf{x}) &= \nabla_{\mathbf{g}_{L}} f(\mathbf{g}_{L}(\mathbf{x}_{L}, \mathbf{y}_{L-1})) \mathbf{J}_{\mathbf{g}_{L}, \mathbf{x}_{i}}(\mathbf{x}_{L}, \mathbf{y}_{L-1}) \\
&= \nabla_{\mathbf{g}_{L}} f(\mathbf{g}_{L}(\mathbf{x}_{L}, \mathbf{y}_{L-1})) \mathbf{J}_{\mathbf{g}_{L}, \mathbf{y}_{L-1}}(\mathbf{x}_{L}, \mathbf{y}_{L-1}) \mathbf{J}_{\mathbf{g}_{L-1}, \mathbf{x}_{i}}(\mathbf{x}_{L-1}, \mathbf{y}_{L-2}) \\
&= \nabla_{\mathbf{g}_{L}} f(\mathbf{g}_{L}(\mathbf{x}_{L}, \mathbf{y}_{L-1})) \mathbf{J}_{\mathbf{g}_{L}, \mathbf{y}_{L-1}}(\mathbf{x}_{L}, \mathbf{y}_{L-1}) \mathbf{J}_{\mathbf{g}_{L-1}, \mathbf{y}_{L-2}}(\mathbf{x}_{L-1}, \mathbf{y}_{L-2}) \mathbf{J}_{\mathbf{g}_{L-2}, \mathbf{x}_{i}}(\mathbf{x}_{L-2}, \mathbf{y}_{L-3}) \\
&\vdots \\
&= \nabla_{\mathbf{g}_{L}} f(\mathbf{g}_{L}(\mathbf{x}_{L}, \mathbf{y}_{L-1})) \mathbf{J}_{\mathbf{g}_{L}, \mathbf{y}_{L-1}}(\mathbf{x}_{L}, \mathbf{y}_{L-1}) \mathbf{J}_{\mathbf{g}_{L-1}, \mathbf{y}_{L-2}}(\mathbf{x}_{L-1}, \mathbf{y}_{L-2}) \cdots \mathbf{J}_{\mathbf{g}_{i+1}, \mathbf{y}_{i}}(\mathbf{x}_{i}, \mathbf{y}_{i-1}) \mathbf{J}_{\mathbf{g}_{i}, \mathbf{x}_{i}}(\mathbf{x}_{i}, \mathbf{y}_{i-1}) \\
\end{align}
$$
More complex neural networks have branches and the layer can depend on multiple previous layers. The chain rule can be derived in a similar way, but the Jacobian matrices will be larger and more complex.
For example, suppose we have
$$
\begin{align}
\mathbf{y}_{1} &= \mathbf{g}_{1}(\mathbf{x}_{1}) \\
\mathbf{y}_{2} &= \mathbf{g}_{2}(\mathbf{x}_{2}, \mathbf{y}_{1}) \\
\mathbf{y}_{3} &= \mathbf{g}_{3}(\mathbf{x}_{3}, \mathbf{y}_{1}, \mathbf{y}_{2}) \\
&\vdots \\
\mathbf{y}_{L} &= \mathbf{g}_{L}(\mathbf{x}_{L}, \mathbf{y}_{L-1}) \\
L &= f(\mathbf{y}_{L}) \\
\end{align}
$$
We could change the expression so that the previously derived chain rule formula can be reused.
$$
\begin{align}
\mathbf{y}’_{2}
&= \mathbf{g}’_{2}(\mathbf{x}_{1}) \\
&= [\mathbf{y}_{1}, \mathbf{y}_{2}]^{\top} \\
&= [\mathbf{g}_{1}(\mathbf{x}_{1}), \mathbf{g}_{2}(\mathbf{x}_{2}, \mathbf{y}_{1})]^{\top} \\
\mathbf{y}_{3} &= \mathbf{g}_{3}(\mathbf{x}_{3}, \mathbf{y}’_{2}) \\
&\vdots \\
\mathbf{y}_{L} &= \mathbf{g}_{L}(\mathbf{x}_{L}, \mathbf{y}_{L-1}) \\
L &= f(\mathbf{y}_{L}) \\
\end{align}
$$
The Jacobian matrix $\mathbf{J}_{\mathbf{g}_{3}, \mathbf{x}_{1}}(\mathbf{x}_{3}, \mathbf{y}_{1}, \mathbf{y}_{2})$ using the previously derived chain rule formula can be computed as follows.
$$
\begin{align}
\mathbf{J}_{\mathbf{g}_{3}, \mathbf{x}_{1}}(\mathbf{x}_{3}, \mathbf{y}_{1}, \mathbf{y}_{2})
&= \mathbf{J}_{\mathbf{g}_{3}, \mathbf{x}_{1}}(\mathbf{x}_{3}, \mathbf{y}’_{2}) \\
&= \mathbf{J}_{\mathbf{g}_{3}, \mathbf{y}’_{2}}(\mathbf{x}_{3}, \mathbf{y}’_{2}) \mathbf{J}_{\mathbf{g}’_{2}, \mathbf{x}_{1}}(\mathbf{x}_{1}) \\
\end{align}
$$
The Jacobian matrix $\mathbf{J}_{\mathbf{g}_{3}, \mathbf{y}’_{2}}(\mathbf{x}_{3}, \mathbf{y}’_{2})$ is just the concatenation of the Jacobian matrices $\mathbf{J}_{\mathbf{g}_{3}, \mathbf{y}_{1}}(\mathbf{x}_{3}, \mathbf{y}_{1})$ and $\mathbf{J}_{\mathbf{g}_{3}, \mathbf{y}_{2}}(\mathbf{x}_{3}, \mathbf{y}_{2})$.
The Jacobian matrix $\mathbf{J}_{\mathbf{g}’_{2}, \mathbf{x}_{1}}(\mathbf{x}_{1})$ is just the concatenation of the Jacobian matrices $\mathbf{J}_{\mathbf{g}_{1}, \mathbf{x}_{1}}(\mathbf{x}_{1})$ and $\mathbf{J}_{\mathbf{g}_{2}, \mathbf{x}_{1}}(\mathbf{x}_{1}) = \mathbf{J}_{\mathbf{g}_{2}, \mathbf{y}_{1}}(\mathbf{x}_{2}, \mathbf{y}_{1}) \mathbf{J}_{\mathbf{g}_{1}, \mathbf{x}_{1}}(\mathbf{x}_{1})$.
The Jacobian matrix $\mathbf{J}_{\mathbf{g}_{3}, \mathbf{x}_{1}}(\mathbf{x}_{3}, \mathbf{y}_{1}, \mathbf{y}_{2})$ the becomes a dot product.
$$
\begin{align}
\mathbf{J}_{\mathbf{g}_{3}, \mathbf{x}_{1}}(\mathbf{x}_{3}, \mathbf{y}_{1}, \mathbf{y}_{2})
&= \mathbf{J}_{\mathbf{g}_{3}, \mathbf{y}’_{2}}(\mathbf{x}_{3}, \mathbf{y}’_{2}) \mathbf{J}_{\mathbf{g}’_{2}, \mathbf{x}_{1}}(\mathbf{x}_{1}) \\
&= [\mathbf{J}_{\mathbf{g}_{3}, \mathbf{y}_{1}}(\mathbf{x}_{3}, \mathbf{y}_{1}), \mathbf{J}_{\mathbf{g}_{3}, \mathbf{y}_{2}}(\mathbf{x}_{3}, \mathbf{y}_{2})]^{\top} [\mathbf{J}_{\mathbf{g}_{1}, \mathbf{x}_{1}}(\mathbf{x}_{1}), \mathbf{J}_{\mathbf{g}_{2}, \mathbf{x}_{1}}(\mathbf{x}_{1})] \\
&= \mathbf{J}_{\mathbf{g}_{3}, \mathbf{y}_{1}}(\mathbf{x}_{3}, \mathbf{y}_{1}) \mathbf{J}_{\mathbf{g}_{1}, \mathbf{x}_{1}}(\mathbf{x}_{1}) + \mathbf{J}_{\mathbf{g}_{3}, \mathbf{y}_{2}}(\mathbf{x}_{3}, \mathbf{y}_{2}) \mathbf{J}_{\mathbf{g}_{2}, \mathbf{x}_{1}}(\mathbf{x}_{1}) \\
&= \mathbf{J}_{\mathbf{g}_{3}, \mathbf{y}_{1}}(\mathbf{x}_{3}, \mathbf{y}_{1}) \mathbf{J}_{\mathbf{g}_{1}, \mathbf{x}_{1}}(\mathbf{x}_{1}) + \mathbf{J}_{\mathbf{g}_{3}, \mathbf{y}_{2}}(\mathbf{x}_{3}, \mathbf{y}_{2}) \mathbf{J}_{\mathbf{g}_{2}, \mathbf{y}_{1}}(\mathbf{x}_{2}, \mathbf{y}_{1}) \mathbf{J}_{\mathbf{g}_{1}, \mathbf{x}_{1}}(\mathbf{x}_{1}) \\
\end{align}
$$
Therefore, no matter how complex the neural network is, the formula for computing the gradient of the loss function with respect to the parameter variables of the neural network is always a series of matrix multiplications of the Jacobian matrices.
Forward Mode and Reverse Mode
This chain rule formula for computing the gradient of the loss function with respect to the parameter variables of the neural network does not state the order of the matrix multiplications.
Consider the gradient of the loss function with respect to the output of the last layer we derived for our example previously,
$$
\begin{align}
\nabla_{\mathbf{x}_{i}} h(\mathbf{x})
&= \nabla_{\mathbf{g}_{L}} f(\mathbf{g}_{L}(\mathbf{x}_{L}, \mathbf{y}_{L-1})) \mathbf{J}_{\mathbf{g}_{L}, \mathbf{y}_{L-1}}(\mathbf{x}_{L}, \mathbf{y}_{L-1}) \mathbf{J}_{\mathbf{g}_{L-1}, \mathbf{y}_{L-2}}(\mathbf{x}_{L-1}, \mathbf{y}_{L-2}) \cdots \mathbf{J}_{\mathbf{g}_{i+1}, \mathbf{y}_{i}}(\mathbf{x}_{i}, \mathbf{y}_{i-1}) \mathbf{J}_{\mathbf{g}_{i}, \mathbf{x}_{i}}(\mathbf{x}_{i}, \mathbf{y}_{i-1}) \\
\end{align}
$$
Technically, we could compute the matrix multiplications from left to right or from right to left. However, the order of the matrix multiplications will affect the computation and storage complexity.
If the matrix multiplications are computed from left to right, all the intermediate layer outputs $\mathbf{y}_{i}$ have to be computed and saved before the matrix multiplications, and they are used in the reverse order as they were computed from the neural network. This is the reverse mode of automatic differentiation.
Because there is only one partial derivative vector being produced when the matrix multiplications are computed from left to right, and this partial derivative vector is often denoted as a column vector $\mathbf{v}$, the reverse mode of automatic differentiation is also called the vector-Jacobian product (VJP), i.e., $\mathbf{v}^{\top} \mathbf{J}$.
If the matrix multiplications are computed from right to left, the matrix multiplications of the Jacobian matrices can be computed as the layers of the neural network are evaluated in the forward order. This is the forward mode of automatic differentiation.
Despite not having to save all the intermediate layer outputs $\mathbf{y}_{i}$, the forward mode of automatic differentiation has to compute the Jacobian matrix multiplications from right to left, which is not as efficient as the reverse mode of automatic differentiation that actually computes the vector-Jacobian product (VJP) for the same purpose in terms of the computation and storage complexity.
To optimize for storage complexity for the forward mode of automatic differentiation, instead of computing $\nabla_{\mathbf{x}_{i}} h(\mathbf{x}) = [\frac{\partial h}{\partial \mathbf{x}_{i, 1}}, \frac{\partial h}{\partial \mathbf{x}_{i, 2}}, \ldots, \frac{\partial h}{\partial \mathbf{x}_{i, n}}]^{\top}$ in one forward pass using Jacobian matrix multiplications, we can compute each $\frac{\partial h}{\partial \mathbf{x}_{i, j}}$ and multiple forward passes will be performed.
$$
\begin{align}
\frac{\partial h}{\partial \mathbf{x}_{i, j}}
&= \nabla_{\mathbf{x}_{i}} h(\mathbf{x}) \mathbf{e}_{j} \\
&= \nabla_{\mathbf{g}_{L}} f(\mathbf{g}_{L}(\mathbf{x}_{L}, \mathbf{y}_{L-1})) \mathbf{J}_{\mathbf{g}_{L}, \mathbf{y}_{L-1}}(\mathbf{x}_{L}, \mathbf{y}_{L-1}) \mathbf{J}_{\mathbf{g}_{L-1}, \mathbf{y}_{L-2}}(\mathbf{x}_{L-1}, \mathbf{y}_{L-2}) \cdots \mathbf{J}_{\mathbf{g}_{i+1}, \mathbf{y}_{i}}(\mathbf{x}_{i}, \mathbf{y}_{i-1}) \mathbf{J}_{\mathbf{g}_{i}, \mathbf{x}_{i}}(\mathbf{x}_{i}, \mathbf{y}_{i-1}) \mathbf{e}_{j} \\
\end{align}
$$
where $\mathbf{e}_{j}$ is the $j$-th standard basis vector in $\mathbb{R}^{n}$.
Because now there is only one vector-valued derivative being produced when the matrix multiplications are computed from right to left, and this vector-valued derivative is often denoted as a column vector $\mathbf{v}$, i.e., the forward mode of automatic differentiation is also called the Jacobian-vector product (JVP), i.e., $\mathbf{J} \mathbf{v}$.
References
Automatic Differentiation Revisited
https://leimao.github.io/blog/Automatic-Differentiation-Revisited/