Chain Rule
Introduction
Extending from univariable chain rule to multivariable functions can be confusing sometimes. Using the de novo chain rule expression, which is based on the matrix multiplication of Jacobian matrices, the chain rule can be expressed in a more intuitive way.
In this blog post, I would like to discuss the de novo chain rule expression, how it unifies the univariable chain rule and multivariable chain rule, and how it can be applied to different areas of mathematics.
The Confusing Chain Rule Expressions
Univariable Chain Rule
Let
Multivariable Chain Rule
We could often see the following expression of multivariable chain rule.
Let
for each
This expression is nothing wrong. But its form is different from the univariable chain rule expression. When it comes to vector calculus, which is often used in neural networks, they become confusing and less useful.
The De Novo Chain Rule Expression
The de novo chain rule expression is more intuitive and are more applicable for different areas of mathematics.
In calculus, the chain rule is a formula that expresses the derivative of the composition of two differentiable functions
or equivalently
Note that the domains
More generally, suppose
where
The chain rule becomes
The chain rule can be equivalently expressed using the matrix multiplication of Jacobian matrices
where
The chain rule using the Jacobian matrix unifies the univariable chain rule and multivariable chain rule.
Examples
Gradient of Linear Functions
Consider a linear function
This is very easy and straightforward to verify using the scalar form of the linear function.
Consider linear functions
Because
Therefore, the gradient of
Consider linear functions
where
Gradient of Quadratic Functions
Consider a quadratic function
We define the following new functions
We first define the function
where
The partial derivative of
We then define the function
where
The Jacobian matrix of
Then we have
Using the chain rule, we have
Therefore,
Hessian of Quadratic Functions
Because we have already derived the gradient of quadratic functions, we could easily derive the Hessian of quadratic functions.
The gradient of quadratic functions is given by
The Hessian of quadratic functions is a Jacobian matrix of the gradient of quadratic functions with respect to
The Hessian of quadratic functions is given by
Least Squares
The least squares problem is a common optimization problem in machine learning and statistics. It can be formulated as minimizing the sum of squared differences between the observed values and the predicted values.
The least squares problem objective function could be defined as
where
The least squares problem objective function could be usually rewritten as
We have to find the optimal
It is also feasible to use the chain rule to derive the gradient of least squares problem objective function.
We define the following new functions
We first define the function
where
The gradient of
This is very easy and straightforward to verify using the scalar form of the linear function.
We then define the function
where
The Jacobian matrix of
Then we have
Using the chain rule, we have
Therefore,