Computing Hessian Matrix Via Automatic Differentiation
Introduction
In many optimization algorithms, especially those involving Taylor series expansions, we will often encounter the need to compute the Hessian matrix of a function to improve the effectiveness of the optimization. In fact, automatic differentiation frameworks like PyTorch and TensorFlow, even though designed to compute the Jacobian matrix of a function, can also be used to compute the Hessian matrix of a function or even any higher-order derivatives of a function.
In this blog post, I will discuss how to compute the Hessian matrix of a function using automatic differentiation frameworks. The article will be focused on the mathematical principles and the computational concepts, rather than the implementations and usages for different automatic differentiation frameworks.
Prerequisites
Jacobian Matrix
Suppose $f: \mathbb{R}^n \to \mathbb{R}^m$ is a function taking a vector $\mathbf{x} \in \mathbb{R}^n$ as input and producing a vector $f(\mathbf{x}) \in \mathbb{R}^m$ as output. The Jacobian matrix $\mathbf{J}$ of $f$, $\mathbf{J}_{f}: \mathbb{R}^n \to \mathbb{R}^{m \times n}$, is defined as the $m \times n$ matrix of first-order partial derivatives:
$$
\begin{align}
\mathbf{J}_{f}
&=
\begin{bmatrix}
\frac{\partial f_{1}}{\partial x_{1}} & \frac{\partial f_{1}}{\partial x_{2}} & \cdots & \frac{\partial f_{1}}{\partial x_{n}} \\
\frac{\partial f_{2}}{\partial x_{1}} & \frac{\partial f_{2}}{\partial x_{2}} & \cdots & \frac{\partial f_{2}}{\partial x_{n}} \\
\vdots & \vdots & \ddots & \vdots \\
\frac{\partial f_{m}}{\partial x_{1}} & \frac{\partial f_{m}}{\partial x_{2}} & \cdots & \frac{\partial f_{m}}{\partial x_{n}} \\
\end{bmatrix}
\end{align}
$$
If $m = 1$, the Jacobian matrix $\mathbf{J}_{f}$ is a row vector, and it is often squeezed to a vector denoted as the gradient $\nabla f: \mathbb{R}^n \to \mathbb{R}^n$ of $f$.
$$
\begin{align}
\nabla f
&=
\left\{\frac{\partial f}{\partial x_{1}}, \frac{\partial f}{\partial x_{2}}, \cdots, \frac{\partial f}{\partial x_{n}}\right\}
\end{align}
$$
This is widely used in deep learning where $f$ is a neural network with a loss function and $\mathbf{x}$ is the parameters of the neural network. The goal is to compute the gradient $\nabla f$ with respect to the parameters $\mathbf{x}$, which is used in the optimization process to update the parameters.
Hessian Matrix
Suppose $f: \mathbb{R}^n \to \mathbb{R}$ is a function taking a vector $\mathbf{x} \in \mathbb{R}^n$ as input and producing a scalar $f(\mathbf{x}) \in \mathbb{R}$ as output. The Hessian matrix $\mathbf{H}$ of $f$ is defined as the $n \times n$ square matrix of second-order partial derivatives:
$$
\begin{align}
\mathbf{H}_{f}
&=
\begin{bmatrix}
\frac{\partial^2 f}{\partial x_{1}^2} & \frac{\partial^2 f}{\partial x_{1} \partial x_{2}} & \cdots & \frac{\partial^2 f}{\partial x_{1} \partial x_{n}} \\
\frac{\partial^2 f}{\partial x_{2} \partial x_{1}} & \frac{\partial^2 f}{\partial x_{2}^2} & \cdots & \frac{\partial^2 f}{\partial x_{2} \partial x_{n}} \\
\vdots & \vdots & \ddots & \vdots \\
\frac{\partial^2 f}{\partial x_{n} \partial x_{1}} & \frac{\partial^2 f}{\partial x_{n} \partial x_{2}} & \cdots & \frac{\partial^2 f}{\partial x_{n}^2} \\
\end{bmatrix}
\end{align}
$$
$$
\begin{align}
\mathbf{H}_{f}(\mathbf{x})
&=
\begin{bmatrix}
\frac{\partial^2 f}{\partial x_{1}^2}(\mathbf{x}) & \frac{\partial^2 f}{\partial x_{1} \partial x_{2}}(\mathbf{x}) & \cdots & \frac{\partial^2 f}{\partial x_{1} \partial x_{n}}(\mathbf{x}) \\
\frac{\partial^2 f}{\partial x_{2} \partial x_{1}}(\mathbf{x}) & \frac{\partial^2 f}{\partial x_{2}^2}(\mathbf{x}) & \cdots & \frac{\partial^2 f}{\partial x_{2} \partial x_{n}}(\mathbf{x}) \\
\vdots & \vdots & \ddots & \vdots \\
\frac{\partial^2 f}{\partial x_{n} \partial x_{1}}(\mathbf{x}) & \frac{\partial^2 f}{\partial x_{n} \partial x_{2}}(\mathbf{x}) & \cdots & \frac{\partial^2 f}{\partial x_{n}^2}(\mathbf{x}) \\
\end{bmatrix} \\
\end{align}
$$
Suppose $\nabla f: \mathbb{R}^n \to \mathbb{R}^n$ is the gradient of $f$, which is defined as the vector of first-order partial derivatives, the Hessian matrix of a function $f$ is the Jacobian matrix of the gradient of the function $\nabla f$ of $f$, i.e., $\mathbf{H}_{f} = \nabla^2 f = \mathbf{J}_{\nabla f}: \mathbb{R}^n \to \mathbb{R}^{n \times n}$ and $\mathbf{H}_{f}(\mathbf{x}) = \nabla^2 f(\mathbf{x}) = \mathbf{J}_{\nabla f}(\mathbf{x}) \in \mathbb{R}^{n \times n}$.
$$
\begin{align}
\mathbf{H}_{f}
&=
\mathbf{J}_{\nabla f} \\
&=
\begin{bmatrix}
\frac{\partial \left( \nabla f \right)_{1}}{\partial x_{1}} & \frac{\partial \left( \nabla f \right)_{1}}{\partial x_{2}} & \cdots & \frac{\partial \left( \nabla f \right)_{1}}{\partial x_{n}} \\
\frac{\partial \left( \nabla f \right)_{2}}{\partial x_{1}} & \frac{\partial \left( \nabla f \right)_{2}}{\partial x_{2}} & \cdots & \frac{\partial \left( \nabla f \right)_{2}}{\partial x_{n}} \\
\vdots & \vdots & \ddots & \vdots \\
\frac{\partial \left( \nabla f \right)_{n}}{\partial x_{1}} & \frac{\partial \left( \nabla f \right)_{n}}{\partial x_{2}} & \cdots & \frac{\partial \left( \nabla f \right)_{n}}{\partial x_{n}} \\
\end{bmatrix} \\
&=
\begin{bmatrix}
\frac{\partial \left( \frac{\partial f}{\partial x_{1}} \right)}{\partial x_{1}} & \frac{\partial \left( \frac{\partial f}{\partial x_{1}} \right)}{\partial x_{2}} & \cdots & \frac{\partial \left( \frac{\partial f}{\partial x_{1}} \right)}{\partial x_{n}} \\
\frac{\partial \left( \frac{\partial f}{\partial x_{2}} \right)}{\partial x_{1}} & \frac{\partial \left( \frac{\partial f}{\partial x_{2}} \right)}{\partial x_{2}} & \cdots & \frac{\partial \left( \frac{\partial f}{\partial x_{2}} \right)}{\partial x_{n}} \\
\vdots & \vdots & \ddots & \vdots \\
\frac{\partial \left( \frac{\partial f}{\partial x_{n}} \right)}{\partial x_{1}} & \frac{\partial \left( \frac{\partial f}{\partial x_{n}} \right)}{\partial x_{2}} & \cdots & \frac{\partial \left( \frac{\partial f}{\partial x_{n}} \right)}{\partial x_{n}} \\
\end{bmatrix} \\
&=
\begin{bmatrix}
\frac{\partial^2 f}{\partial x_{1}^2} & \frac{\partial^2 f}{\partial x_{1} \partial x_{2}} & \cdots & \frac{\partial^2 f}{\partial x_{1} \partial x_{n}} \\
\frac{\partial^2 f}{\partial x_{2} \partial x_{1}} & \frac{\partial^2 f}{\partial x_{2}^2} & \cdots & \frac{\partial^2 f}{\partial x_{2} \partial x_{n}} \\
\vdots & \vdots & \ddots & \vdots \\
\frac{\partial^2 f}{\partial x_{n} \partial x_{1}} & \frac{\partial^2 f}{\partial x_{n} \partial x_{2}} & \cdots & \frac{\partial^2 f}{\partial x_{n}^2} \\
\end{bmatrix}
\end{align}
$$
Because the gradient $\nabla f \in \mathbb{R}^n$ is also a Jacobian matrix $\mathbf{J}_{f} \in \mathbb{R}^{1 \times n}$, we could also call Hessian matrix $\mathbf{H}_{f} \in \mathbb{R}^{n \times n}$ is the Jacobian matrix of Jacobian matrix of $f$, i.e., $\mathbf{H}_{f} = \mathbf{J}_{\mathbf{J}_{f}}: \mathbb{R}^n \to \mathbb{R}^{n \times n}$ and $\mathbf{H}_{f}(\mathbf{x}) = \mathbf{J}_{\mathbf{J}_{f}}(\mathbf{x}) \in \mathbb{R}^{n \times n}$.
Computing Jacobian Matrix In Automatic Differentiation Framework
An automatic differentiation framework like PyTorch and TensorFlow provides a way to compute the Jacobian matrix of a function using the chain rule. A function is defined as a composition of operators. When the operator function is defined, including its forward evaluation and gradient evaluation, either explicitly or implicitly, the framework can compute the Jacobian matrix by applying the chain rule to each operator in the function.
Operators In Automatic Differentiation Framework
In the context of automatic differentiation, an operator is a function that takes one or more inputs and produces an output. Operators can be simple mathematical functions like addition, multiplication, or more complex functions like more sophisticated neural network layers that involve multiple operations.
If a completely new operator, which cannot be expressed as a composition of existing operators, is needed, it can be implemented in the automatic differentiation framework. The operator must implement two functions: forward
and backward
. The forward
function computes the output of the operator given its inputs, while the backward
function computes the gradient of the output with respect to its inputs.
For example, suppose the operator Exp
has never been defined in PyTorch. We can define it as follows:
1 | class Exp(torch.autograd.Function): |
A composite operator which can be expressed as a composition of existing operators, sometimes referred as layers in neural network can also be implemented in the automatic differentiation framework. In this case, only the forward
function must be implemented, and the backward
function will be automatically generated by the framework by applying the chain rule to each operator in the composite operator.
For example, suppose we want to implement a composite operator ReLU
which is a composition of the Exp
operator and the Max
operator. We can implement it as follows:
1 | class ReLU(torch.nn.Module): |
The backward
function can also be implemented explicitly. This is sometimes useful when the user wants to approximate the gradient of the output with respect to its inputs sometimes because the composite operator is not differentiable or the backward function performance can be improved by using a different algorithm or implementation.
For a composite operator, whether the backward
function is implemented explicitly will make a difference in the representation of the operator in the computation graph. Specifically, if the backward
function is not implemented explicitly, the composite operator will be represented as a subgraph that consists of the elementary operators that compose the composite operator. If the backward
function is implemented explicitly, the composite operator will be represented as a single node in the computation graph. Either way, the framework will be able to compute the Jacobian matrix of the composite operator by applying the chain rule to each operator while traversing the computation graph.
Finally, we have note that to compute the Jacobian matrix the forward
function must be implemented regardless of whether the automatic differentiation framework uses forward-mode or reverse-mode for automatic differentiation. This is because the backward
function requires the output of the forward
function to compute the gradient of the output with respect to its inputs. That’s why the forward
function must be implemented whereas the backward
function is optional.
Computing Hessian Matrix In Automatic Differentiation Framework
The Hessian matrix could not be computed directly using the chain rule in an automatic differentiation framework, because the automatic differentiation framework is usually only designed compute the Jacobian matrix of a function as mentioned previously. However, because the Hessian matrix is the Jacobian matrix of the gradient of a function, we can compute the Hessian matrix by first computing the gradients of the function and then computing the Jacobian matrix of each gradient function.
Concretely, we can compute the Hessian matrix of a function $f$ as follows:
$$
\begin{align}
\mathbf{H}_{f}
&=
\mathbf{J}_{\nabla f} \\
&=
\begin{bmatrix}
\frac{\partial \left( \frac{\partial f}{\partial x_{1}} \right)}{\partial x_{1}} & \frac{\partial \left( \frac{\partial f}{\partial x_{1}} \right)}{\partial x_{2}} & \cdots & \frac{\partial \left( \frac{\partial f}{\partial x_{1}} \right)}{\partial x_{n}} \\
\frac{\partial \left( \frac{\partial f}{\partial x_{2}} \right)}{\partial x_{1}} & \frac{\partial \left( \frac{\partial f}{\partial x_{2}} \right)}{\partial x_{2}} & \cdots & \frac{\partial \left( \frac{\partial f}{\partial x_{2}} \right)}{\partial x_{n}} \\
\vdots & \vdots & \ddots & \vdots \\
\frac{\partial \left( \frac{\partial f}{\partial x_{n}} \right)}{\partial x_{1}} & \frac{\partial \left( \frac{\partial f}{\partial x_{n}} \right)}{\partial x_{2}} & \cdots & \frac{\partial \left( \frac{\partial f}{\partial x_{n}} \right)}{\partial x_{n}} \\
\end{bmatrix} \\
&=
\begin{bmatrix}
\frac{\partial \left( \nabla f \right)_{1}}{\partial x_{1}} & \frac{\partial \left( \nabla f \right)_{1}}{\partial x_{2}} & \cdots & \frac{\partial \left( \nabla f \right)_{1}}{\partial x_{n}} \\
\frac{\partial \left( \nabla f \right)_{2}}{\partial x_{1}} & \frac{\partial \left( \nabla f \right)_{2}}{\partial x_{2}} & \cdots & \frac{\partial \left( \nabla f \right)_{2}}{\partial x_{n}} \\
\vdots & \vdots & \ddots & \vdots \\
\frac{\partial \left( \nabla f \right)_{n}}{\partial x_{1}} & \frac{\partial \left( \nabla f \right)_{n}}{\partial x_{2}} & \cdots & \frac{\partial \left( \nabla f \right)_{n}}{\partial x_{n}} \\
\end{bmatrix} \\
&=
\begin{bmatrix}
\nabla \left( \nabla f \right)_{1} \\
\nabla \left( \nabla f \right)_{2} \\
\vdots \\
\nabla \left( \nabla f \right)_{n} \\
\end{bmatrix} \\
&=
\nabla \left( \nabla f \right) \\
\end{align}
$$
Each of the $\nabla \left( \nabla f \right)_{1}$, $\nabla \left( \nabla f \right)_{2}$, …, $\nabla \left( \nabla f \right)_{n}$ can be computed using the automatic differentiation framework, provided that the forward
function of the functions $\left( \nabla f \right)_{1}$, $\left( \nabla f \right)_{2}$, …, $\left( \nabla f \right)_{n}$ are defined and the input values are provided.
Since we usually have only defined the forward
function of the function $f$ in an automatic differentiation framework, how could we define the forward
function of the functions $\left( \nabla f \right)_{1}$, $\left( \nabla f \right)_{2}$, …, $\left( \nabla f \right)_{n}$? The mathematical formulation of the forward
function of the functions $\left( \nabla f \right)_{1}$, $\left( \nabla f \right)_{2}$, …, $\left( \nabla f \right)_{n}$ can be very complicated and we would not be able to derive each of them if $f$ is complicated and $n$ is large. It turns out that we could get those forward
functions for free by using the backward
function of the function $f$. By traversing the computation graph to compute the gradients of the function $\nabla f$ with respect to its inputs, the mathematical formulation of $\left( \nabla f \right)_{1}$, $\left( \nabla f \right)_{2}$, …, $\left( \nabla f \right)_{n}$ is recorded in the computation graph. That is also to say, the backward
function of the function $f$ used for computing $\left( \nabla f \right)_{1}$, $\left( \nabla f \right)_{2}$, …, $\left( \nabla f \right)_{n}$ are just the forward
functions of the functions $\left( \nabla f \right)_{1}$, $\left( \nabla f \right)_{2}$, …, $\left( \nabla f \right)_{n}$. Then the forward
function of the functions $\left( \nabla f \right)_{1}$, $\left( \nabla f \right)_{2}$, …, $\left( \nabla f \right)_{n}$ become defined in the computation graph. As I mentioned previously, the backward
function is implicitly defined if the forward
function is defined. The backward
function of the functions $\left( \nabla f \right)_{1}$, $\left( \nabla f \right)_{2}$, …, $\left( \nabla f \right)_{n}$ have also become defined in the computation graph.
The input values to the functions $\nabla \left( \nabla f \right)_{1}$, $\nabla \left( \nabla f \right)_{2}$, …, $\nabla \left( \nabla f \right)_{n}$ are also straightforward. They are just the output values of the functions $\left( \nabla f \right)_{1}$, $\left( \nabla f \right)_{2}$, …, $\left( \nabla f \right)_{n}$.
Using this mean, one can actually compute not only the Hessian matrix of a function $f$ but also derivatives of any order of the function $f$ by using the automatic differentiation framework.
Finally, computing the Hessian matrix of a function $f$ using an automatic differentiation framework is expensive, because usually the number of parameters $n$ is large. In a neural network, even though evaluating some of the functions $\left( \nabla f \right)_{1}$, $\left( \nabla f \right)_{2}$, …, $\left( \nabla f \right)_{n}$ can be vectorized, it would help too much because the number of parameters $n$ is still large. To compute the Jacobian matrix of the function $f$ in an automatic differentiation framework, we need to run once the forward-mode or reverse-mode. To compute the Hessian matrix of the function $f$, suppose evaluating every $m$ functions of $\left( \nabla f \right)_{1}$, $\left( \nabla f \right)_{2}$, …, $\left( \nabla f \right)_{n}$ can be vectorized, we will still need to run the forward-mode or reverse-mode $\frac{n}{m}$ times. This is often inapplicable when $f$ is complicated and $n$ is large. This is also why we could only see computing the Hessian matrix of a function $f$ when $f$ is simple and $n$ is small in some optimization algorithms.
References
Computing Hessian Matrix Via Automatic Differentiation
https://leimao.github.io/blog/Compute-Hessian-Automatic-Differentiation/