Computing Hessian Matrix Via Automatic Differentiation

Introduction

In many optimization algorithms, especially those involving Taylor series expansions, we will often encounter the need to compute the Hessian matrix of a function to improve the effectiveness of the optimization. In fact, automatic differentiation frameworks like PyTorch and TensorFlow, even though designed to compute the Jacobian matrix of a function, can also be used to compute the Hessian matrix of a function or even any higher-order derivatives of a function.

In this blog post, I will discuss how to compute the Hessian matrix of a function using automatic differentiation frameworks. The article will be focused on the mathematical principles and the computational concepts, rather than the implementations and usages for different automatic differentiation frameworks.

Prerequisites

Jacobian Matrix

Suppose $f: \mathbb{R}^n \to \mathbb{R}^m$ is a function taking a vector $\mathbf{x} \in \mathbb{R}^n$ as input and producing a vector $f(\mathbf{x}) \in \mathbb{R}^m$ as output. The Jacobian matrix $\mathbf{J}$ of $f$, $\mathbf{J}_{f}: \mathbb{R}^n \to \mathbb{R}^{m \times n}$, is defined as the $m \times n$ matrix of first-order partial derivatives:

$$
\begin{align}
\mathbf{J}_{f}
&=
\begin{bmatrix}
\frac{\partial f_{1}}{\partial x_{1}} & \frac{\partial f_{1}}{\partial x_{2}} & \cdots & \frac{\partial f_{1}}{\partial x_{n}} \\
\frac{\partial f_{2}}{\partial x_{1}} & \frac{\partial f_{2}}{\partial x_{2}} & \cdots & \frac{\partial f_{2}}{\partial x_{n}} \\
\vdots & \vdots & \ddots & \vdots \\
\frac{\partial f_{m}}{\partial x_{1}} & \frac{\partial f_{m}}{\partial x_{2}} & \cdots & \frac{\partial f_{m}}{\partial x_{n}} \\
\end{bmatrix}
\end{align}
$$

If $m = 1$, the Jacobian matrix $\mathbf{J}_{f}$ is a row vector, and it is often squeezed to a vector denoted as the gradient $\nabla f: \mathbb{R}^n \to \mathbb{R}^n$ of $f$.

$$
\begin{align}
\nabla f
&=
\left\{\frac{\partial f}{\partial x_{1}}, \frac{\partial f}{\partial x_{2}}, \cdots, \frac{\partial f}{\partial x_{n}}\right\}
\end{align}
$$

This is widely used in deep learning where $f$ is a neural network with a loss function and $\mathbf{x}$ is the parameters of the neural network. The goal is to compute the gradient $\nabla f$ with respect to the parameters $\mathbf{x}$, which is used in the optimization process to update the parameters.

Hessian Matrix

Suppose $f: \mathbb{R}^n \to \mathbb{R}$ is a function taking a vector $\mathbf{x} \in \mathbb{R}^n$ as input and producing a scalar $f(\mathbf{x}) \in \mathbb{R}$ as output. The Hessian matrix $\mathbf{H}$ of $f$ is defined as the $n \times n$ square matrix of second-order partial derivatives:

$$
\begin{align}
\mathbf{H}_{f}
&=
\begin{bmatrix}
\frac{\partial^2 f}{\partial x_{1}^2} & \frac{\partial^2 f}{\partial x_{1} \partial x_{2}} & \cdots & \frac{\partial^2 f}{\partial x_{1} \partial x_{n}} \\
\frac{\partial^2 f}{\partial x_{2} \partial x_{1}} & \frac{\partial^2 f}{\partial x_{2}^2} & \cdots & \frac{\partial^2 f}{\partial x_{2} \partial x_{n}} \\
\vdots & \vdots & \ddots & \vdots \\
\frac{\partial^2 f}{\partial x_{n} \partial x_{1}} & \frac{\partial^2 f}{\partial x_{n} \partial x_{2}} & \cdots & \frac{\partial^2 f}{\partial x_{n}^2} \\
\end{bmatrix}
\end{align}
$$

$$
\begin{align}
\mathbf{H}_{f}(\mathbf{x})
&=
\begin{bmatrix}
\frac{\partial^2 f}{\partial x_{1}^2}(\mathbf{x}) & \frac{\partial^2 f}{\partial x_{1} \partial x_{2}}(\mathbf{x}) & \cdots & \frac{\partial^2 f}{\partial x_{1} \partial x_{n}}(\mathbf{x}) \\
\frac{\partial^2 f}{\partial x_{2} \partial x_{1}}(\mathbf{x}) & \frac{\partial^2 f}{\partial x_{2}^2}(\mathbf{x}) & \cdots & \frac{\partial^2 f}{\partial x_{2} \partial x_{n}}(\mathbf{x}) \\
\vdots & \vdots & \ddots & \vdots \\
\frac{\partial^2 f}{\partial x_{n} \partial x_{1}}(\mathbf{x}) & \frac{\partial^2 f}{\partial x_{n} \partial x_{2}}(\mathbf{x}) & \cdots & \frac{\partial^2 f}{\partial x_{n}^2}(\mathbf{x}) \\
\end{bmatrix} \\
\end{align}
$$

Suppose $\nabla f: \mathbb{R}^n \to \mathbb{R}^n$ is the gradient of $f$, which is defined as the vector of first-order partial derivatives, the Hessian matrix of a function $f$ is the Jacobian matrix of the gradient of the function $\nabla f$ of $f$, i.e., $\mathbf{H}_{f} = \nabla^2 f = \mathbf{J}_{\nabla f}: \mathbb{R}^n \to \mathbb{R}^{n \times n}$ and $\mathbf{H}_{f}(\mathbf{x}) = \nabla^2 f(\mathbf{x}) = \mathbf{J}_{\nabla f}(\mathbf{x}) \in \mathbb{R}^{n \times n}$.

$$
\begin{align}
\mathbf{H}_{f}
&=
\mathbf{J}_{\nabla f} \\
&=
\begin{bmatrix}
\frac{\partial \left( \nabla f \right)_{1}}{\partial x_{1}} & \frac{\partial \left( \nabla f \right)_{1}}{\partial x_{2}} & \cdots & \frac{\partial \left( \nabla f \right)_{1}}{\partial x_{n}} \\
\frac{\partial \left( \nabla f \right)_{2}}{\partial x_{1}} & \frac{\partial \left( \nabla f \right)_{2}}{\partial x_{2}} & \cdots & \frac{\partial \left( \nabla f \right)_{2}}{\partial x_{n}} \\
\vdots & \vdots & \ddots & \vdots \\
\frac{\partial \left( \nabla f \right)_{n}}{\partial x_{1}} & \frac{\partial \left( \nabla f \right)_{n}}{\partial x_{2}} & \cdots & \frac{\partial \left( \nabla f \right)_{n}}{\partial x_{n}} \\
\end{bmatrix} \\
&=
\begin{bmatrix}
\frac{\partial \left( \frac{\partial f}{\partial x_{1}} \right)}{\partial x_{1}} & \frac{\partial \left( \frac{\partial f}{\partial x_{1}} \right)}{\partial x_{2}} & \cdots & \frac{\partial \left( \frac{\partial f}{\partial x_{1}} \right)}{\partial x_{n}} \\
\frac{\partial \left( \frac{\partial f}{\partial x_{2}} \right)}{\partial x_{1}} & \frac{\partial \left( \frac{\partial f}{\partial x_{2}} \right)}{\partial x_{2}} & \cdots & \frac{\partial \left( \frac{\partial f}{\partial x_{2}} \right)}{\partial x_{n}} \\
\vdots & \vdots & \ddots & \vdots \\
\frac{\partial \left( \frac{\partial f}{\partial x_{n}} \right)}{\partial x_{1}} & \frac{\partial \left( \frac{\partial f}{\partial x_{n}} \right)}{\partial x_{2}} & \cdots & \frac{\partial \left( \frac{\partial f}{\partial x_{n}} \right)}{\partial x_{n}} \\
\end{bmatrix} \\
&=
\begin{bmatrix}
\frac{\partial^2 f}{\partial x_{1}^2} & \frac{\partial^2 f}{\partial x_{1} \partial x_{2}} & \cdots & \frac{\partial^2 f}{\partial x_{1} \partial x_{n}} \\
\frac{\partial^2 f}{\partial x_{2} \partial x_{1}} & \frac{\partial^2 f}{\partial x_{2}^2} & \cdots & \frac{\partial^2 f}{\partial x_{2} \partial x_{n}} \\
\vdots & \vdots & \ddots & \vdots \\
\frac{\partial^2 f}{\partial x_{n} \partial x_{1}} & \frac{\partial^2 f}{\partial x_{n} \partial x_{2}} & \cdots & \frac{\partial^2 f}{\partial x_{n}^2} \\
\end{bmatrix}
\end{align}
$$

Because the gradient $\nabla f \in \mathbb{R}^n$ is also a Jacobian matrix $\mathbf{J}_{f} \in \mathbb{R}^{1 \times n}$, we could also call Hessian matrix $\mathbf{H}_{f} \in \mathbb{R}^{n \times n}$ is the Jacobian matrix of Jacobian matrix of $f$, i.e., $\mathbf{H}_{f} = \mathbf{J}_{\mathbf{J}_{f}}: \mathbb{R}^n \to \mathbb{R}^{n \times n}$ and $\mathbf{H}_{f}(\mathbf{x}) = \mathbf{J}_{\mathbf{J}_{f}}(\mathbf{x}) \in \mathbb{R}^{n \times n}$.

Computing Jacobian Matrix In Automatic Differentiation Framework

An automatic differentiation framework like PyTorch and TensorFlow provides a way to compute the Jacobian matrix of a function using the chain rule. A function is defined as a composition of operators. When the operator function is defined, including its forward evaluation and gradient evaluation, either explicitly or implicitly, the framework can compute the Jacobian matrix by applying the chain rule to each operator in the function.

Operators In Automatic Differentiation Framework

In the context of automatic differentiation, an operator is a function that takes one or more inputs and produces an output. Operators can be simple mathematical functions like addition, multiplication, or more complex functions like more sophisticated neural network layers that involve multiple operations.

If a completely new operator, which cannot be expressed as a composition of existing operators, is needed, it can be implemented in the automatic differentiation framework. The operator must implement two functions: forward and backward. The forward function computes the output of the operator given its inputs, while the backward function computes the gradient of the output with respect to its inputs.

For example, suppose the operator Exp has never been defined in PyTorch. We can define it as follows:

exp.py
1
2
3
4
5
6
7
8
9
10
11
12
class Exp(torch.autograd.Function):

@staticmethod
def forward(ctx, i):
result = i.exp()
ctx.save_for_backward(result)
return result

@staticmethod
def backward(ctx, grad_output):
result, = ctx.saved_tensors
return grad_output * result

A composite operator which can be expressed as a composition of existing operators, sometimes referred as layers in neural network can also be implemented in the automatic differentiation framework. In this case, only the forward function must be implemented, and the backward function will be automatically generated by the framework by applying the chain rule to each operator in the composite operator.

For example, suppose we want to implement a composite operator ReLU which is a composition of the Exp operator and the Max operator. We can implement it as follows:

relu.py
1
2
3
4
5
6
7
8
9
class ReLU(torch.nn.Module):

def __init__(self):
super(ReLU, self).__init__()
self.exp = Exp()
self.max = torch.nn.Max()

def forward(self, i):
return self.max(self.exp(i), 0)

The backward function can also be implemented explicitly. This is sometimes useful when the user wants to approximate the gradient of the output with respect to its inputs sometimes because the composite operator is not differentiable or the backward function performance can be improved by using a different algorithm or implementation.

For a composite operator, whether the backward function is implemented explicitly will make a difference in the representation of the operator in the computation graph. Specifically, if the backward function is not implemented explicitly, the composite operator will be represented as a subgraph that consists of the elementary operators that compose the composite operator. If the backward function is implemented explicitly, the composite operator will be represented as a single node in the computation graph. Either way, the framework will be able to compute the Jacobian matrix of the composite operator by applying the chain rule to each operator while traversing the computation graph.

Finally, we have note that to compute the Jacobian matrix the forward function must be implemented regardless of whether the automatic differentiation framework uses forward-mode or reverse-mode for automatic differentiation. This is because the backward function requires the output of the forward function to compute the gradient of the output with respect to its inputs. That’s why the forward function must be implemented whereas the backward function is optional.

Computing Hessian Matrix In Automatic Differentiation Framework

The Hessian matrix could not be computed directly using the chain rule in an automatic differentiation framework, because the automatic differentiation framework is usually only designed compute the Jacobian matrix of a function as mentioned previously. However, because the Hessian matrix is the Jacobian matrix of the gradient of a function, we can compute the Hessian matrix by first computing the gradients of the function and then computing the Jacobian matrix of each gradient function.

Concretely, we can compute the Hessian matrix of a function $f$ as follows:

$$
\begin{align}
\mathbf{H}_{f}
&=
\mathbf{J}_{\nabla f} \\
&=
\begin{bmatrix}
\frac{\partial \left( \frac{\partial f}{\partial x_{1}} \right)}{\partial x_{1}} & \frac{\partial \left( \frac{\partial f}{\partial x_{1}} \right)}{\partial x_{2}} & \cdots & \frac{\partial \left( \frac{\partial f}{\partial x_{1}} \right)}{\partial x_{n}} \\
\frac{\partial \left( \frac{\partial f}{\partial x_{2}} \right)}{\partial x_{1}} & \frac{\partial \left( \frac{\partial f}{\partial x_{2}} \right)}{\partial x_{2}} & \cdots & \frac{\partial \left( \frac{\partial f}{\partial x_{2}} \right)}{\partial x_{n}} \\
\vdots & \vdots & \ddots & \vdots \\
\frac{\partial \left( \frac{\partial f}{\partial x_{n}} \right)}{\partial x_{1}} & \frac{\partial \left( \frac{\partial f}{\partial x_{n}} \right)}{\partial x_{2}} & \cdots & \frac{\partial \left( \frac{\partial f}{\partial x_{n}} \right)}{\partial x_{n}} \\
\end{bmatrix} \\
&=
\begin{bmatrix}
\frac{\partial \left( \nabla f \right)_{1}}{\partial x_{1}} & \frac{\partial \left( \nabla f \right)_{1}}{\partial x_{2}} & \cdots & \frac{\partial \left( \nabla f \right)_{1}}{\partial x_{n}} \\
\frac{\partial \left( \nabla f \right)_{2}}{\partial x_{1}} & \frac{\partial \left( \nabla f \right)_{2}}{\partial x_{2}} & \cdots & \frac{\partial \left( \nabla f \right)_{2}}{\partial x_{n}} \\
\vdots & \vdots & \ddots & \vdots \\
\frac{\partial \left( \nabla f \right)_{n}}{\partial x_{1}} & \frac{\partial \left( \nabla f \right)_{n}}{\partial x_{2}} & \cdots & \frac{\partial \left( \nabla f \right)_{n}}{\partial x_{n}} \\
\end{bmatrix} \\
&=
\begin{bmatrix}
\nabla \left( \nabla f \right)_{1} \\
\nabla \left( \nabla f \right)_{2} \\
\vdots \\
\nabla \left( \nabla f \right)_{n} \\
\end{bmatrix} \\
&=
\nabla \left( \nabla f \right) \\
\end{align}
$$

Each of the $\nabla \left( \nabla f \right)_{1}$, $\nabla \left( \nabla f \right)_{2}$, …, $\nabla \left( \nabla f \right)_{n}$ can be computed using the automatic differentiation framework, provided that the forward function of the functions $\left( \nabla f \right)_{1}$, $\left( \nabla f \right)_{2}$, …, $\left( \nabla f \right)_{n}$ are defined and the input values are provided.

Since we usually have only defined the forward function of the function $f$ in an automatic differentiation framework, how could we define the forward function of the functions $\left( \nabla f \right)_{1}$, $\left( \nabla f \right)_{2}$, …, $\left( \nabla f \right)_{n}$? The mathematical formulation of the forward function of the functions $\left( \nabla f \right)_{1}$, $\left( \nabla f \right)_{2}$, …, $\left( \nabla f \right)_{n}$ can be very complicated and we would not be able to derive each of them if $f$ is complicated and $n$ is large. It turns out that we could get those forward functions for free by using the backward function of the function $f$. By traversing the computation graph to compute the gradients of the function $\nabla f$ with respect to its inputs, the mathematical formulation of $\left( \nabla f \right)_{1}$, $\left( \nabla f \right)_{2}$, …, $\left( \nabla f \right)_{n}$ is recorded in the computation graph. That is also to say, the backward function of the function $f$ used for computing $\left( \nabla f \right)_{1}$, $\left( \nabla f \right)_{2}$, …, $\left( \nabla f \right)_{n}$ are just the forward functions of the functions $\left( \nabla f \right)_{1}$, $\left( \nabla f \right)_{2}$, …, $\left( \nabla f \right)_{n}$. Then the forward function of the functions $\left( \nabla f \right)_{1}$, $\left( \nabla f \right)_{2}$, …, $\left( \nabla f \right)_{n}$ become defined in the computation graph. As I mentioned previously, the backward function is implicitly defined if the forward function is defined. The backward function of the functions $\left( \nabla f \right)_{1}$, $\left( \nabla f \right)_{2}$, …, $\left( \nabla f \right)_{n}$ have also become defined in the computation graph.

The input values to the functions $\nabla \left( \nabla f \right)_{1}$, $\nabla \left( \nabla f \right)_{2}$, …, $\nabla \left( \nabla f \right)_{n}$ are also straightforward. They are just the output values of the functions $\left( \nabla f \right)_{1}$, $\left( \nabla f \right)_{2}$, …, $\left( \nabla f \right)_{n}$.

Using this mean, one can actually compute not only the Hessian matrix of a function $f$ but also derivatives of any order of the function $f$ by using the automatic differentiation framework.

Finally, computing the Hessian matrix of a function $f$ using an automatic differentiation framework is expensive, because usually the number of parameters $n$ is large. In a neural network, even though evaluating some of the functions $\left( \nabla f \right)_{1}$, $\left( \nabla f \right)_{2}$, …, $\left( \nabla f \right)_{n}$ can be vectorized, it would help too much because the number of parameters $n$ is still large. To compute the Jacobian matrix of the function $f$ in an automatic differentiation framework, we need to run once the forward-mode or reverse-mode. To compute the Hessian matrix of the function $f$, suppose evaluating every $m$ functions of $\left( \nabla f \right)_{1}$, $\left( \nabla f \right)_{2}$, …, $\left( \nabla f \right)_{n}$ can be vectorized, we will still need to run the forward-mode or reverse-mode $\frac{n}{m}$ times. This is often inapplicable when $f$ is complicated and $n$ is large. This is also why we could only see computing the Hessian matrix of a function $f$ when $f$ is simple and $n$ is small in some optimization algorithms.

References

Computing Hessian Matrix Via Automatic Differentiation

https://leimao.github.io/blog/Compute-Hessian-Automatic-Differentiation/

Author

Lei Mao

Posted on

05-22-2025

Updated on

05-22-2025

Licensed under


Comments