Chain Rule

Introduction

Extending from univariable chain rule to multivariable functions can be confusing sometimes. Using the de novo chain rule expression, which is based on the matrix multiplication of Jacobian matrices, the chain rule can be expressed in a more intuitive way.

In this blog post, I would like to discuss the de novo chain rule expression, how it unifies the univariable chain rule and multivariable chain rule, and how it can be applied to different areas of mathematics.

The Confusing Chain Rule Expressions

Univariable Chain Rule

Let $w = f(x)$ be a differentiable function of $x$ and $x = g(t)$ be a differentiable function of $t$. Then $w = f(g(t))$ is a differentiable function of $t$ and

$$
\begin{align}
\frac{dw}{dt} &= \frac{df}{dx} \frac{dx}{dt} \\
\end{align}
$$

Multivariable Chain Rule

We could often see the following expression of multivariable chain rule.

Let $w = f(x_{1}, x_{2}, \ldots, x_{m})$ be a differentiable function of $m$ independent variables, and for each $i \in [1, m]$, let $x_{i} = g_{i}(t_{1}, t_{2}, \ldots, t_{n})$ be a differentiable function of $n$ independent variables. Then $w = f(g_{1}(t_{1}, t_{2}, \ldots, t_{n}), g_{2}(t_{1}, t_{2}, \ldots, t_{n}), \ldots, g_{n}(t_{1}, t_{2}, \ldots, t_{n}))$ is a differentiable function of $n$ independent variables and

$$
\begin{align}
\frac{\partial w}{\partial t_{j}} &= \sum_{i=1}^{m} \frac{\partial w}{\partial x_{i}} \frac{\partial x_{i}}{\partial t_{j}} \\
\end{align}
$$

for each $j \in [1, n]$.

This expression is nothing wrong. But its form is different from the univariable chain rule expression. When it comes to vector calculus, which is often used in neural networks, they become confusing and less useful.

The De Novo Chain Rule Expression

The de novo chain rule expression is more intuitive and are more applicable for different areas of mathematics.

In calculus, the chain rule is a formula that expresses the derivative of the composition of two differentiable functions $f: Y \to Z$ and $g: X \to Y$ in terms of the derivatives of $f$ and $g$. More precisely, if $h = f \circ g: X \to Z$ is the composition of $f$ and $g$ such that $h(x) = f(g(x))$, the chain rule states that

$$
\begin{align}
h’(x) &= f’(g(x)) g’(x) \\
\end{align}
$$

or equivalently

$$
\begin{align}
h’ = (f’ \circ g) g’ \\
\end{align}
$$

Note that the domains $X$, $Y$, and $Z$ are not limited to real numbers.

More generally, suppose $X \in \mathbb{R}^{n}$, $Y \in \mathbb{R}^{m}$, and $Z \in \mathbb{R}^{k}$, we have

$$
\begin{align}
\mathbf{h}(\mathbf{x}) &= \mathbf{f}(\mathbf{g}(\mathbf{x})) \\
\end{align}
$$

where $\mathbf{x} \in \mathbb{R}^{n}$, $\mathbf{g}(\mathbf{x}) \in \mathbb{R}^{m}$, and $\mathbf{f}(\mathbf{g}(\mathbf{x})) \in \mathbb{R}^{k}$.

The chain rule becomes

$$
\begin{align}
\mathbf{h}’(\mathbf{x}) &= \mathbf{f}’(\mathbf{g}(\mathbf{x})) \mathbf{g}’(\mathbf{x}) \\
\end{align}
$$

The chain rule can be equivalently expressed using the matrix multiplication of Jacobian matrices

$$
\begin{align}
\mathbf{J}_{\mathbf{h}}(\mathbf{x}) &= \mathbf{J}_{\mathbf{f}}(\mathbf{g}(\mathbf{x})) \mathbf{J}_{\mathbf{g}}(\mathbf{x}) \\
\end{align}
$$

where $\mathbf{J}_{\mathbf{h}}(\mathbf{x})$ is the Jacobian matrix of $\mathbf{h}$ with respect to $\mathbf{x}$, which is a matrix in $\mathbb{R}^{k \times n}$. $\mathbf{J}_{\mathbf{f}}(\mathbf{g}(\mathbf{x}))$ is the Jacobian matrix of $\mathbf{f}$ with respect to $\mathbf{g}(\mathbf{x})$, which is a matrix in $\mathbb{R}^{k \times m}$. $\mathbf{J}_{\mathbf{g}}(\mathbf{x})$ is the Jacobian matrix of $\mathbf{g}$ with respect to $\mathbf{x}$, which is a matrix in $\mathbb{R}^{m \times n}$.

The chain rule using the Jacobian matrix unifies the univariable chain rule and multivariable chain rule.

Examples

Gradient of Linear Functions

Consider a linear function $f(x) = \mathbf{a}^{\top} \mathbf{x}$, where $\mathbf{a} \in \mathbb{R}^{n}$ and $\mathbf{x} \in \mathbb{R}^{n}$. The gradient of $f$ with respect to $\mathbf{x}$ is given by

$$
\begin{align}
\nabla f(\mathbf{x}) &= \mathbf{a} \\
\end{align}
$$

This is very easy and straightforward to verify using the scalar form of the linear function.

Consider linear functions $\mathbf{f}(\mathbf{x}) = \mathbf{A} \mathbf{x}$, where $\mathbf{A} \in \mathbb{R}^{m \times n}$ and $\mathbf{x} \in \mathbb{R}^{n}$. The gradient of $\mathbf{f}$ with respect to $\mathbf{x}$ is given by

$$
\begin{align}
\mathbf{J}_{\mathbf{f}}(\mathbf{x})
&= [\nabla f_{1}(\mathbf{x}), \nabla f_{2}(\mathbf{x}), \ldots, \nabla f_{m}(\mathbf{x})]^{\top} \\
\end{align}
$$

Because $f_{i}(\mathbf{x}) = \mathbf{A}_{i} \mathbf{x}$, where $\mathbf{A}_{i}$ is the $i$-th row of $\mathbf{A}$ , we have

$$
\begin{align}
\nabla f_{i}(\mathbf{x}) &= \mathbf{A}_{i}^{\top} \\
\end{align}
$$

Therefore, the gradient of $\mathbf{f}$ with respect to $\mathbf{x}$ is given by

$$
\begin{align}
\mathbf{J}_{\mathbf{f}}(\mathbf{x})
&= [\nabla f_{1}(\mathbf{x}), \nabla f_{2}(\mathbf{x}), \ldots, \nabla f_{m}(\mathbf{x})]^{\top} \\
&= [\mathbf{A}_{1}^{\top}, \mathbf{A}_{2}^{\top}, \ldots, \mathbf{A}_{m}^{\top}]^{\top} \\
&= {\mathbf{A}^{\top}}^{\top} \\
&= \mathbf{A} \\
\end{align}
$$

Consider linear functions $\mathbf{f}(\mathbf{X}) = \mathbf{A} \mathbf{X}$, where $\mathbf{A} \in \mathbb{R}^{m \times n}$ and $\mathbf{X} \in \mathbb{R}^{n \times k}$, which is also known as matrix multiplication. Because

$$
\begin{align}
\mathbf{f}(\mathbf{X}) &= \left[ \mathbf{f}(\mathbf{X}_{:,1}), \mathbf{f}(\mathbf{X}_{:,2}), \ldots, \mathbf{f}(\mathbf{X}_{:,k}) \right] \\
\end{align}
$$

where $\mathbf{X}_{:,i}$ is the $i$-th column of $\mathbf{X}$, and $\mathbf{f}(\mathbf{X}_{:,i}) = \mathbf{A} \mathbf{X}_{:,i}$, we have the gradient of $\mathbf{f}$ with respect to $\mathbf{X}_{:,i}$ is given by

$$
\begin{align}
\mathbf{J}_{\mathbf{f}}(\mathbf{X}_{:,i}) &= \mathbf{A} \\
\end{align}
$$

Gradient of Quadratic Functions

Consider a quadratic function $f(\mathbf{x}) = \mathbf{x}^{\top} \mathbf{A} \mathbf{x}$, where $\mathbf{A} \in \mathbb{R}^{n \times n}$ and $\mathbf{x} \in \mathbb{R}^{n}$. We could create the functional composition of $f$ instead.

We define the following new functions $g$ and $h$, such that $f = g \circ h$.

We first define the function $g$ as

$$
\begin{align}
g(\mathbf{x}) &= \mathbf{x}_{1}^{\top} \mathbf{x}_{2}
\end{align}
$$

where $\mathbf{x} = [\mathbf{x}_{1}, \mathbf{x}_{2}]^{\top}$ and $\mathbf{x}_{1}, \mathbf{x}_{2} \in \mathbb{R}^{n}$.

The partial derivative of $g$ with respect to $\mathbf{x}$ is given by

$$
\begin{align}
\nabla g(\mathbf{x}) &= [ \nabla g_{\mathbf{x}_{1}}(\mathbf{x}), \nabla g_{\mathbf{x}_{2}}(\mathbf{x}) ]^{\top} \\
&= [\mathbf{x}_{2}, \mathbf{x}_{1}]^{\top} \\
\end{align}
$$

We then define the function $h$ as

$$
\begin{align}
h(\mathbf{x})
&= [h_{1}(\mathbf{x}), h_{2}(\mathbf{x})]^{\top} \\
&= [\mathbf{I} \mathbf{x}, \mathbf{A} \mathbf{x}]^{\top} \\
&= [\mathbf{x}, \mathbf{A} \mathbf{x}]^{\top} \\
\end{align}
$$

where $\mathbf{x}_{1}, \mathbf{x}_{2} \in \mathbb{R}^{n}$, $h_{1}(\mathbf{x}) = \mathbf{I} \mathbf{x}$, and $h_{2}(\mathbf{x}) = \mathbf{A} \mathbf{x}$.

The Jacobian matrix of $h$ with respect to $\mathbf{x}$ is given by

$$
\begin{align}
\mathbf{J}_{\mathbf{h}}(\mathbf{x}) &= [\mathbf{J}_{\mathbf{h}_{1}}(\mathbf{x}), \mathbf{J}_{\mathbf{h}_{2}}(\mathbf{x})]^{\top} \\
&= [\mathbf{I}, \mathbf{A}]^{\top} \\
\end{align}
$$

Then we have

$$
\begin{align}
f(\mathbf{x}) &= g(h(\mathbf{x})) \\
&= g([\mathbf{x}, \mathbf{A} \mathbf{x}]) \\
&= \mathbf{x}^{\top} \mathbf{A} \mathbf{x} \\
\end{align}
$$

Using the chain rule, we have

$$
\begin{align}
\nabla f(\mathbf{x})^{\top} &= \nabla g(h(\mathbf{x}))^{\top} \mathbf{J}_{\mathbf{h}}(\mathbf{x}) \\
&= {[\mathbf{A} \mathbf{x}, \mathbf{x}]^{\top}}^{\top} [\mathbf{I}, \mathbf{A}]^{\top} \\
&= [(\mathbf{A} \mathbf{x})^{\top}, \mathbf{x}^{\top}] [\mathbf{I}, \mathbf{A}]^{\top} \\
&= (\mathbf{A} \mathbf{x})^{\top} \mathbf{I} + \mathbf{x}^{\top} \mathbf{A} \\
&= \mathbf{x}^{\top} \mathbf{A}^{\top} + \mathbf{x}^{\top} \mathbf{A} \\
&= \mathbf{x}^{\top} (\mathbf{A}^{\top} + \mathbf{A}) \\
\end{align}
$$

Therefore,

$$
\begin{align}
\nabla f(\mathbf{x}) &= {\nabla f(\mathbf{x})^{\top}}^{\top} \\
&= \left(\mathbf{x}^{\top} (\mathbf{A}^{\top} + \mathbf{A}) \right)^{\top} \\
&= (\mathbf{A}^{\top} + \mathbf{A})^{\top} {\mathbf{x}^{\top}}^{\top} \\
&= (\mathbf{A} + \mathbf{A}^{\top}) \mathbf{x} \\
\end{align}
$$

Hessian of Quadratic Functions

Because we have already derived the gradient of quadratic functions, we could easily derive the Hessian of quadratic functions.

The gradient of quadratic functions is given by

$$
\begin{align}
\nabla f(\mathbf{x}) &= (\mathbf{A} + \mathbf{A}^{\top}) \mathbf{x} \\
\end{align}
$$

The Hessian of quadratic functions is a Jacobian matrix of the gradient of quadratic functions with respect to $\mathbf{x}$.

The Hessian of quadratic functions is given by

$$
\begin{align}
\mathbf{H}(\mathbf{x}) &= \mathbf{J}_{\nabla f}(\mathbf{x}) \\
&= \mathbf{A} + \mathbf{A}^{\top} \\
\end{align}
$$

Least Squares

The least squares problem is a common optimization problem in machine learning and statistics. It can be formulated as minimizing the sum of squared differences between the observed values and the predicted values.

The least squares problem objective function could be defined as

$$
\begin{align}
f(\mathbf{w}) &= \lVert \mathbf{X} \mathbf{w} - \mathbf{y} \rVert^{2} \\
\end{align}
$$

where $\mathbf{X} \in \mathbb{R}^{m \times n}$ is the design matrix, $\mathbf{w} \in \mathbb{R}^{n}$ is the vector of coefficients, $\mathbf{y} \in \mathbb{R}^{m}$ is the vector of observed values, and $\lVert \cdot \rVert$ is the L2 norm.

The least squares problem objective function could be usually rewritten as

$$
\begin{align}
f(\mathbf{w}) &= (\mathbf{X} \mathbf{w} - \mathbf{y})^{\top} (\mathbf{X} \mathbf{w} - \mathbf{y}) \\
&= \left(\left(\mathbf{X} \mathbf{w} \right)^{\top} - \mathbf{y}^{\top}\right) \left(\mathbf{X} \mathbf{w} - \mathbf{y}\right) \\
&= \left(\mathbf{w}^{\top} \mathbf{X}^{\top} - \mathbf{y}^{\top}\right) \left(\mathbf{X} \mathbf{w} - \mathbf{y}\right) \\
&= \mathbf{w}^{\top} \mathbf{X}^{\top} \mathbf{X} \mathbf{w} - \mathbf{y}^{\top} \mathbf{X} \mathbf{w} - \mathbf{w}^{\top} \mathbf{X}^{\top} \mathbf{y} + \mathbf{y}^{\top} \mathbf{y} \\
&= \mathbf{w}^{\top} \mathbf{X}^{\top} \mathbf{X} \mathbf{w} - \mathbf{y}^{\top} \mathbf{X} \mathbf{w} - \mathbf{y}^{\top} \mathbf{X} \mathbf{w} + \mathbf{y}^{\top} \mathbf{y} \\
&= \mathbf{w}^{\top} \mathbf{X}^{\top} \mathbf{X} \mathbf{w} - 2 \mathbf{y}^{\top} \mathbf{X} \mathbf{w} + \mathbf{y}^{\top} \mathbf{y} \\
\end{align}
$$

We have to find the optimal $\mathbf{w}^{\ast}$ such that the gradient of $f$ with respect to $\mathbf{w}$ is equal to zero.

$$
\begin{align}
\nabla f(\mathbf{w}) &= \nabla \left( \mathbf{w}^{\top} \mathbf{X}^{\top} \mathbf{X} \mathbf{w} - 2 \mathbf{y}^{\top} \mathbf{X} \mathbf{w} + \mathbf{y}^{\top} \mathbf{y} \right) \\
&= \nabla \left( \mathbf{w}^{\top} \mathbf{X}^{\top} \mathbf{X} \mathbf{w} \right) - \nabla \left( 2\mathbf{y}^{\top} \mathbf{X} \mathbf{w} \right) + 0 \\
&= \left( \mathbf{X}^{\top} \mathbf{X} + \left(\mathbf{X}^{\top} \mathbf{X}\right)^{\top} \right) \mathbf{w} - 2 \mathbf{X}^{\top} \mathbf{y} \\
&= \left( \mathbf{X}^{\top} \mathbf{X} + \mathbf{X}^{\top} \mathbf{X} \right) \mathbf{w} - 2 \mathbf{X}^{\top} \mathbf{y} \\
&= 2 \mathbf{X}^{\top} \mathbf{X} \mathbf{w} - 2 \mathbf{X}^{\top} \mathbf{y} \\
\end{align}
$$

It is also feasible to use the chain rule to derive the gradient of least squares problem objective function.

We define the following new functions $g$ and $h$, such that $f = g \circ h$.

We first define the function $g$ as

$$
\begin{align}
g(\mathbf{x}) &= \lVert \mathbf{x} \rVert^{2} \\
&= \mathbf{x}^{\top} \mathbf{x} \\
\end{align}
$$

where $\mathbf{x} \in \mathbb{R}^{m}$.

The gradient of $g$ with respect to $\mathbf{x}$ is given by

$$
\begin{align}
\nabla g(\mathbf{x}) &= 2 \mathbf{x} \\
\end{align}
$$

This is very easy and straightforward to verify using the scalar form of the linear function.

We then define the function $h$ as

$$
\begin{align}
h(\mathbf{w}) &= \mathbf{X} \mathbf{w} - \mathbf{y} \\
&= \mathbf{X} \mathbf{w} - \mathbf{y} \\
\end{align}
$$

where $\mathbf{w} \in \mathbb{R}^{n}$.

The Jacobian matrix of $h$ with respect to $\mathbf{w}$ is given by

$$
\begin{align}
\mathbf{J}_{\mathbf{h}}(\mathbf{w}) &= \mathbf{X} \\
\end{align}
$$

Then we have

$$
\begin{align}
f(\mathbf{w}) &= g(h(\mathbf{w})) \\
&= g(\mathbf{X} \mathbf{w} - \mathbf{y}) \\
&= \lVert \mathbf{X} \mathbf{w} - \mathbf{y} \rVert^{2} \\
\end{align}
$$

Using the chain rule, we have

$$
\begin{align}
\nabla f(\mathbf{w})^{\top} &= \nabla g(h(\mathbf{w}))^{\top} \mathbf{J}_{\mathbf{h}}(\mathbf{w}) \\
&= 2 (\mathbf{X} \mathbf{w} - \mathbf{y})^{\top} \mathbf{X} \\
&= 2 (\mathbf{w}^{\top} \mathbf{X}^{\top} - \mathbf{y}^{\top}) \mathbf{X} \\
&= 2 \mathbf{w}^{\top} \mathbf{X}^{\top} \mathbf{X} - 2 \mathbf{y}^{\top} \mathbf{X} \\
\end{align}
$$

Therefore,

$$
\begin{align}
\nabla f(\mathbf{w}) &= {\nabla f(\mathbf{w})^{\top}}^{\top} \\
&= \left(2 \mathbf{w}^{\top} \mathbf{X}^{\top} \mathbf{X} - 2 \mathbf{y}^{\top} \mathbf{X}\right)^{\top} \\
&= 2 \mathbf{X}^{\top} \mathbf{X} \mathbf{w} - 2 \mathbf{X}^{\top} \mathbf{y} \\
\end{align}
$$

References

Author

Lei Mao

Posted on

04-06-2025

Updated on

04-06-2025

Licensed under


Comments