Optimal Brain Surgeon
Introduction
The pruning methods for modern neural networks, including large language models, are drawing inspiration from the classical Optimal Brain Surgeon algorithm.
In this blog post, we will derive the Optimal Brain Surgeon algorithm mathematically from scratch.
Mathematical Prerequisites
Positive Definite Matrix Is Invertible
We will show a quick proof that a positive semi-definite matrix is positive definite if and only if it is invertible.
Proof
Let $A$ be a positive semi-definite $n \times n$ matrix.
We will first prove by contradiction that if $A$ is positive definite, then $A$ is invertible.
Because $A$ is positive definite, for any non-zero vector $x \in \mathbb{R}^n$, we have $x^{\top} A x > 0$.
Suppose $A$ is not invertible. This means that the null space of $A$, denoted as $\text{Null}(A) = \{x \in \mathbb{R}^n \mid Ax = 0\}$, contains at least one non-zero vector $v \neq 0$.
Then we must have $Av = 0$ and $v^{\top} A v = v^{\top} (Av) = v^{\top} (0) = 0$. This raises a contradiction because $v$ is a non-zero vector and $A$ is positive definite, which means $v^{\top} A v > 0$.
Therefore, if $A$ is positive definite, then $A$ must be invertible.
Now we will prove again by contradiction that if $A$ is invertible, then $A$ is positive definite.
Suppose $A$ is a positive semi-definite, not positive definite, and invertible matrix. There must exist a non-zero vector $w \neq 0$ such that $w^{\top} A w = 0$.
Because $A$ is positive semi-definite, $A$ is symmetric by definition. Therefore, there exists an orthogonal matrix $P$ and a diagonal matrix $D$ with non-negative eigenvalues $\lambda_1, \lambda_2, \dots, \lambda_n$ on the diagonal such that $A = P D P^{\top}$.
Then we have $w^{\top} A w = w^{\top} (P D P^{\top}) w = (P^{\top} w)^{\top} D (P^{\top} w) = y^{\top} D y$, where $y = P^{\top} w$. Since $P$ is an orthogonal matrix and $w \neq 0$, $y = P^{\top} w$ must also be a non-zero vector.
So we have $y^{\top} D y = \sum_{i=1}^n \lambda_i y_i^2 = 0$. Since all $\lambda_i \ge 0$ and $y_i^2 \ge 0$, this equality can only hold if for all $i$ where $y_i \neq 0$, we have $\lambda_i = 0$.
This implies that $D$ has at least one zero eigenvalue (since $y$ is non-zero, at least one $y_i \neq 0$). The determinant of $D$ is the product of its eigenvalues: $\det(D) = \lambda_1 \lambda_2 \dots \lambda_n = 0$.
Since $\det(A) = \det(P D P^{\top}) = \det(P) \det(D) \det(P^{\top}) = (1) (0) (1) = 0$, $A$ is not invertible. This contradicts the initial assumption that $A$ is invertible.
Therefore, if $A$ is invertible, then $A$ is positive definite.
This concludes the proof. $\square$
The Inverse of a Positive Definite Matrix Is Positive Definite
We will show a quick proof that the inverse of a positive definite matrix is also positive definite.
Proof
Let $A$ be a positive definite $n \times n$ matrix. We will show that $A^{-1}$ is also positive definite.
Since $A$ is positive definite, for any non-zero vector $x \in \mathbb{R}^n$, we have $x^{\top} A x > 0$.
For any non-zero vector $y \in \mathbb{R}^n$, because $A$ is invertible, we can set $y = Ax$. Then we have
$$
\begin{align}
y^{\top} A^{-1} y &= (Ax)^{\top} A^{-1} (Ax) \\
&= x^{\top} A^{\top} A^{-1} Ax \\
&= x^{\top} A^{\top} x \\
\end{align}
$$
Because $A$ is positive definite and positive definite matrices are symmetric by definition, we have $A^{\top} = A$. Then we have
$$
\begin{align}
y^{\top} A^{-1} y
&= x^{\top} A^{\top} x \\
&= x^{\top} A x \\
&> 0 \\
\end{align}
$$
Therefore, $A^{-1}$ is positive definite.
This concludes the proof. $\square$
Taylor Expansion
Suppose we have a function $f: \mathbb{R}^n \to \mathbb{R}$ that is twice continuously differentiable. The Taylor expansion of $f$ around a point $\mathbf{a} \in \mathbb{R}^n$ is given by:
$$
\begin{align}
f\left(\mathbf{x}\right) &= f\left(\mathbf{a}\right) + \nabla f\left(\mathbf{a}\right)^{\top} \left(\mathbf{x}-\mathbf{a}\right) + \frac{1}{2} \left(\mathbf{x}-\mathbf{a}\right)^{\top} \mathbf{H}\left(\mathbf{a}\right) \left(\mathbf{x}-\mathbf{a}\right) + O\left(\lVert \mathbf{x}-\mathbf{a} \rVert^2\right) \
\end{align}
$$
In our case, $f$ is a neural network with a loss function, $\mathbf{w}$ is the weight vector of the neural network, and $\mathbf{w}^{\prime}$ is the weight vector after pruning. The Taylor expansion allows us to approximate the change in the loss function due to a small perturbation in the weights.
$$
\begin{align}
f\left(\mathbf{w}^{\prime}\right) &= f\left(\mathbf{w}\right) + \nabla f\left(\mathbf{w}\right)^{\top} \left(\mathbf{w}^{\prime}-\mathbf{w}\right) + \frac{1}{2} \left(\mathbf{w}^{\prime}-\mathbf{w}\right)^{\top} \mathbf{H}\left(\mathbf{w}\right) \left(\mathbf{w}^{\prime}-\mathbf{w}\right) + O\left(\lVert \mathbf{w}^{\prime}-\mathbf{w} \rVert^2\right) \
\end{align}
$$
$$
\begin{align}
f\left(\mathbf{w}^{\prime}\right) - f\left(\mathbf{w}\right)
&= \nabla f\left(\mathbf{w}\right)^{\top} \left(\mathbf{w}^{\prime}-\mathbf{w}\right) + \frac{1}{2} \left(\mathbf{w}^{\prime}-\mathbf{w}\right)^{\top} \mathbf{H}\left(\mathbf{w}\right) \left(\mathbf{w}^{\prime}-\mathbf{w}\right) + O\left(\lVert \mathbf{w}^{\prime}-\mathbf{w} \rVert^2\right) \
\end{align}
$$
We further define $\delta \mathbf{w} = \mathbf{w}^{\prime} - \mathbf{w}$ and $\delta E = f\left(\mathbf{w}^{\prime}\right) - f\left(\mathbf{w}\right)$. This allows us to rewrite the Taylor expansion as:
$$
\begin{align}
\delta E
&= \nabla f\left(\mathbf{w}\right)^{\top} \delta \mathbf{w} + \frac{1}{2} \delta \mathbf{w}^{\top} \mathbf{H}\left(\mathbf{w}\right) \delta \mathbf{w} + O\left(\lVert \delta \mathbf{w} \rVert^2\right) \
\end{align}
$$
Woodbury Matrix Identity
Given an invertible matrix $A$, the Woodbury matrix identity states that for matrices $B$, $C$, and $D$ of appropriate dimensions, the following holds:
$$
\begin{align}
\left(A + B C D\right)^{-1} &= A^{-1} - A^{-1} B \left(C^{-1} + D A^{-1} B\right)^{-1} D A^{-1} \\
\end{align}
$$
Proof
Let $A$ be an invertible matrix, and let $B$, $C$, and $D$ be matrices of appropriate dimensions. We will show that the formula holds.
$$
\begin{align}
\left( A^{-1} - A^{-1} B \left(C^{-1} + D A^{-1} B\right)^{-1} D A^{-1} \right) \left(A + B C D\right)
&= A^{-1} A + A^{-1} B C D - A^{-1} B \left(C^{-1} + D A^{-1} B\right)^{-1} D A^{-1} A - A^{-1} B \left(C^{-1} + D A^{-1} B\right)^{-1} D A^{-1} B C D \\
&= I + A^{-1} B C D - A^{-1} B \left(C^{-1} + D A^{-1} B\right)^{-1} D - A^{-1} B \left(C^{-1} + D A^{-1} B\right)^{-1} D A^{-1} B C D \\
&= I + A^{-1} B C D - A^{-1} B \left(C^{-1} + D A^{-1} B\right)^{-1} \left(D + D A^{-1} B C D\right) \\
&= I + A^{-1} B C D - A^{-1} B \left(C^{-1} + D A^{-1} B\right)^{-1} \left(C^{-1} + D A^{-1} B\right) C D \\
&= I + A^{-1} B C D - A^{-1} B I C D \\
&= I + A^{-1} B C D - A^{-1} B C D \\
&= I \\
\end{align}
$$
Therefore, we have
$$
\begin{align}
\left(A + B C D\right)^{-1} &= A^{-1} - A^{-1} B \left(C^{-1} + D A^{-1} B\right)^{-1} D A^{-1} \\
\end{align}
$$
This concludes the proof. $\square$
Optimal Brain Surgeon
Optimal Brain Surgeon Assumptions
Based on the Taylor expansion, the Optimal Brain Surgeon algorithm makes the following assumptions:
- The neural network with loss function $f$ is trained to a local minimum, i.e., $\nabla f\left(\mathbf{w}\right) = 0$.
- The third-order term in the Taylor expansion is negligible, i.e., $O\left(\lVert \delta \mathbf{w} \rVert^2\right) = 0$.
- The quadratic function $\frac{1}{2} \delta \mathbf{w}^{\top} \mathbf{H}\left(\mathbf{w}\right) \delta \mathbf{w}$ is convex, i.e., the Hessian matrix $\mathbf{H}\left(\mathbf{w}\right)$ is positive semi-definite. Note that this assumption was never explicitly stated but used in the original paper.
- The Hessian matrix $\mathbf{H}\left(\mathbf{w}\right)$, which is symmetric, is not only positive semi-definite but also positive definite. This implies that the Hessian matrix is invertible. Note that this assumption was also never explicitly stated but used in the original paper.
Unstructured Pruning and Fine-Tuning Weights
Based on the assumptions, the Taylor expansion can be simplified to the following:
$$
\begin{align}
\delta E
&= \frac{1}{2} \delta \mathbf{w}^{\top} \mathbf{H}\left(\mathbf{w}\right) \delta \mathbf{w} \\
\end{align}
$$
The goal of the Optimal Brain Surgeon algorithm is to minimize the loss difference $\delta E$ by selecting the optimal weights to prune and fine-tuning the remaining weights. Therefore, additional constraints are added so that weight pruning is accounted for optimization.
Suppose $\mathbf{w}$ and $\delta \mathbf{w}$ are column vectors of size $n$, we have
$$
\begin{align}
\mathbf{w} = \begin{bmatrix}
w_1 \\
w_2 \\
\vdots \\
w_n
\end{bmatrix}
\quad
\delta \mathbf{w} = \begin{bmatrix}
\delta w_1 \\
\delta w_2 \\
\vdots \\
\delta w_n
\end{bmatrix} \\
\end{align}
$$
Suppose $w_{q}$ is pruned, the constraint becomes
$$
\begin{align}
\delta w_{q} + w_{q} = 0 \\
\end{align}
$$
or more generally
$$
\begin{align}
\mathbf{e}_q^{\top} \delta \mathbf{w} + w_q = 0 \\
\end{align}
$$
where $\mathbf{e}_q$ is a natural basis column vector of size $n$ with all elements equal to 0 except for the $q$-th element, which is equal to 1.
Therefore, the optimization problem can be formulated as follows:
$$
\begin{align}
\min_{q} \left\{\min_{\delta \mathbf{w}} \left\{ \frac{1}{2} \delta \mathbf{w}^{\top} \mathbf{H}\left(\mathbf{w}\right) \delta \mathbf{w} \right\} \text{ s.t. } \mathbf{e}_q^{\top} \delta \mathbf{w} + w_q = 0 \right\}\\
\end{align}
$$
If $n$ is small, the outer minimization can be solved by brute force, i.e., by iterating over all $q \in [1, n]$ and selecting the $q$ that minimizes the inner minimization.
To solve the inner minimization, because the Hessian of $\delta E$ is also $\mathbf{H}\left(\mathbf{w}\right)$, which is positive semi-definite by assumption, $\delta E$ is a convex function. Therefore, the optimal solution can be found by setting the gradient of $\delta E$ to zero.
To account for the constraint $\mathbf{e}_q^{\top} \delta \mathbf{w} + w_q = 0$, which is a linear constraint, we can use the method of Lagrange multipliers. We introduce a Lagrange multiplier $\lambda$ and define the Lagrangian function as follows:
$$
\begin{align}
\mathcal{L}_{q}(\delta \mathbf{w}, \lambda) &= \frac{1}{2} \delta \mathbf{w}^{\top} \mathbf{H}\left(\mathbf{w}\right) \delta \mathbf{w} + \lambda \left(\mathbf{e}_q^{\top} \delta \mathbf{w} + w_q\right) \\
\end{align}
$$
Taking the gradient of the Lagrangian with respect to $\delta \mathbf{w}$ and setting it to zero gives us:
$$
\begin{align}
\nabla_{\delta \mathbf{w}} \mathcal{L}_{q}(\delta \mathbf{w}, \lambda) &= \mathbf{H}\left(\mathbf{w}\right) \delta \mathbf{w} + \lambda \mathbf{e}_q = \mathbf{0} \\
\end{align}
$$
Using the assumption that the Hessian matrix is invertible, we can rearrange this equation to solve for $\delta \mathbf{w}$:
$$
\begin{align}
\delta \mathbf{w} &= -\lambda \mathbf{H}\left(\mathbf{w}\right)^{-1} \mathbf{e}_q\\
\end{align}
$$
Taking the gradient of the Lagrangian with respect to $\lambda$ and setting it to zero gives us the constraint:
$$
\begin{align}
\nabla_{\lambda} \mathcal{L}_{q}(\delta \mathbf{w}, \lambda) = \mathbf{e}_q^{\top} \delta \mathbf{w} + w_q = 0\\
\end{align}
$$
Substituting the expression for $\delta \mathbf{w}$ into this constraint gives us:
$$
\begin{align}
\nabla_{\lambda} \mathcal{L}_{q}(\delta \mathbf{w}, \lambda)
&= \mathbf{e}_q^{\top} \left(-\lambda \mathbf{H}\left(\mathbf{w}\right)^{-1} \mathbf{e}_q\right) + w_q \\
&= -\lambda \mathbf{e}_q^{\top} \mathbf{H}\left(\mathbf{w}\right)^{-1} \mathbf{e}_q + w_q \\
&= -\lambda \left(\mathbf{H}\left(\mathbf{w}\right)^{-1}\right)_{qq} + w_q \\
&= 0 \\
\end{align}
$$
Therefore, we can solve for $\lambda$:
$$
\begin{align}
\lambda &= \frac{w_q}{\left(\mathbf{H}\left(\mathbf{w}\right)^{-1}\right)_{qq}} \\
\end{align}
$$
where $\left(\mathbf{H}\left(\mathbf{w}\right)^{-1}\right)_{qq}$ is the $q$-th diagonal element of the inverse of the Hessian matrix $\mathbf{H}\left(\mathbf{w}\right)$.
Substituting this expression for $\lambda$ back into the expression for $\delta \mathbf{w}$ gives us:
$$
\begin{align}
\delta \mathbf{w}
&= -\lambda \mathbf{H}\left(\mathbf{w}\right)^{-1} \mathbf{e}_q\\
&= -\frac{w_q}{\left(\mathbf{H}\left(\mathbf{w}\right)^{-1}\right)_{qq}} \mathbf{H}\left(\mathbf{w}\right)^{-1} \mathbf{e}_q \\
\end{align}
$$
The resulting $\mathcal{L}_{q}(\delta \mathbf{w}, \lambda)$, denoted as the “saliency” of the $q$-th weight, is given by:
$$
\begin{align}
\mathcal{L}_{q}(\delta \mathbf{w}, \lambda)
&= \frac{1}{2} \delta \mathbf{w}^{\top} \mathbf{H}\left(\mathbf{w}\right) \delta \mathbf{w} + \lambda \left(\mathbf{e}_q^{\top} \delta \mathbf{w} + w_q\right) \\
&= \frac{1}{2} \left(-\frac{w_q}{\left(\mathbf{H}\left(\mathbf{w}\right)^{-1}\right)_{qq}} \mathbf{H}\left(\mathbf{w}\right)^{-1} \mathbf{e}_q\right)^{\top} \mathbf{H}\left(\mathbf{w}\right) \left(-\frac{w_q}{\left(\mathbf{H}\left(\mathbf{w}\right)^{-1}\right)_{qq}} \mathbf{H}\left(\mathbf{w}\right)^{-1} \mathbf{e}_q\right) + \lambda \left(\mathbf{e}_q^{\top} \delta \mathbf{w} + w_q\right) \\
&= \frac{1}{2} \left(\frac{w_q^2}{\left(\mathbf{H}\left(\mathbf{w}\right)^{-1}\right)_{qq}^2}\right) \left(\mathbf{H}\left(\mathbf{w}\right)^{-1} \mathbf{e}_q\right)^{\top} \mathbf{H}\left(\mathbf{w}\right) \left(\mathbf{H}\left(\mathbf{w}\right)^{-1} \mathbf{e}_q\right) + 0 \\
&= \frac{1}{2} \left(\frac{w_q^2}{\left(\mathbf{H}\left(\mathbf{w}\right)^{-1}\right)_{qq}^2}\right) \left(\mathbf{e}_q^{\top} \left(\mathbf{H}\left(\mathbf{w}\right)^{-1}\right)^{\top} \mathbf{H}\left(\mathbf{w}\right) \mathbf{H}\left(\mathbf{w}\right)^{-1} \mathbf{e}_q\right) \\
&= \frac{1}{2} \left(\frac{w_q^2}{\left(\mathbf{H}\left(\mathbf{w}\right)^{-1}\right)_{qq}^2}\right) \left(\mathbf{e}_q^{\top} \left(\mathbf{H}\left(\mathbf{w}\right)^{-1}\right)^{\top} \mathbf{e}_q\right) \\
&= \frac{1}{2} \left(\frac{w_q^2}{\left(\mathbf{H}\left(\mathbf{w}\right)^{-1}\right)_{qq}^2}\right) \left(\mathbf{e}_q^{\top} \left(\mathbf{H}\left(\mathbf{w}\right)^{-1}\right) \mathbf{e}_q\right)^{\top} \\
&= \frac{1}{2} \left(\frac{w_q^2}{\left(\mathbf{H}\left(\mathbf{w}\right)^{-1}\right)_{qq}^2}\right) \left(\mathbf{H}\left(\mathbf{w}\right)^{-1}\right)_{qq}^{\top} \\
&= \frac{1}{2} \left(\frac{w_q^2}{\left(\mathbf{H}\left(\mathbf{w}\right)^{-1}\right)_{qq}^2}\right) \left(\mathbf{H}\left(\mathbf{w}\right)^{-1}\right)_{qq} \\
&= \frac{1}{2} \frac{w_q^2}{\left(\mathbf{H}\left(\mathbf{w}\right)^{-1}\right)_{qq}} \\
\end{align}
$$
The above derivation is straightforward when $w_{q}$ is a scalar, i.e., $w_{q} \in \mathbb{R}$, which is common in the case of unstructured pruning. In fact, the original Optimal Brain Surgeon algorithm only discussed the scalar case. However, in the case of structured pruning, $w_{q}$ is typically a vector, i.e., $w_{q} \in \mathbb{R}^m$, where $m$ is the number of weights in the structure, or even a higher dimensional tensor. It makes optimization more complicated at the first glance. It turns out that the forms of the mathematical expressions are the same for the vector case, as long as the shapes of different components are carefully handled.
Structured Pruning and Fine-Tuning Weights
In the case of unstructured pruning where $w_{q} \in \mathbb{R}$, we have $\mathbf{w} \in \mathbb{R}^{n \times 1}$, $\delta \mathbf{w} \in \mathbb{R}^{n \times 1}$, $\mathbf{H}\left(\mathbf{w}\right) \in \mathbb{R}^{n \times n}$, $\mathbf{e}_q \in \mathbb{R}^{n \times 1}$, and $\lambda \in \mathbb{R}$. In the case of structured pruning, the dimensions of those variables become different.
To formulate the optimization problem for structured pruning, one straightforward way is to group the weights into $n$ groups, where each group contains $m$ weights. The optimization problem can be formulated as follows:
$$
\begin{align}
\min_{\mathbf{q}} \left\{\min_{\delta \mathbf{w}} \left\{ \frac{1}{2} \delta \mathbf{w}^{\top} \mathbf{H}\left(\mathbf{w}\right) \delta \mathbf{w} \right\} \text{ s.t. } \mathbf{e}_{q_{i}}^{\top} \delta \mathbf{w} + w_{q_{i}} = 0, \forall i \in [1, m] \right\}\\
\end{align}
$$
where $\mathbf{q} = \left\{q_1, q_2, \ldots, q_m\right\}$ is the set of indices of the weights to be pruned, and and $\mathbf{w} \in \mathbb{R}^{nm \times 1}$ and $\delta \mathbf{w} \in \mathbb{R}^{nm \times 1}$ remain to be a column vector.
We define the vector selection matrix $\mathbf{e}_{\mathbf{q}} = \left[\mathbf{e}_{q_{1}}, \mathbf{e}_{q_{2}}, \ldots, \mathbf{e}_{q_{m}}\right] \in \mathbb{R}^{mn \times m}$, where $\mathbf{e}_{q_{i}}$ is a natural basis column vector of size $nm$ with all elements equal to 0 except for the $q_i$-th element, and $\mathbf{w}_{\mathbf{q}} = \left[w_{q_{1}}, w_{q_{2}}, \ldots, w_{q_{m}}\right]^{\top} \in \mathbb{R}^{m \times 1}$ is a column vector of the weights to be pruned. The above formulation can be further vectorized as follows:
$$
\begin{align}
\min_{\mathbf{q}} \left\{\min_{\delta \mathbf{w}} \left\{ \frac{1}{2} \delta \mathbf{w}^{\top} \mathbf{H}\left(\mathbf{w}\right) \delta \mathbf{w} \right\} \text{ s.t. } \mathbf{e}_{\mathbf{q}}^{\top} \delta \mathbf{w} + \mathbf{w}_{\mathbf{q}} = \mathbf{0} \right\}\\
\end{align}
$$
In this way, the optimization problem formulation for the structured pruning case has exactly the same form as the unstructured pruning case.
After introducing the Lagrange multipliers $\boldsymbol{\lambda} = \left[\lambda_1, \lambda_2, \ldots, \lambda_m\right]^{\top} \in \mathbb{R}^{m \times 1}$, we can define the Lagrangian function as follows:
$$
\begin{align}
\mathcal{L}_{\mathbf{q}}(\delta \mathbf{w}, \boldsymbol{\lambda})
&= \frac{1}{2} \delta \mathbf{w}^{\top} \mathbf{H}\left(\mathbf{w}\right) \delta \mathbf{w} + \sum_{i=1}^{m} \lambda_i \left(\mathbf{e}_{\mathbf{q}_{i}}^{\top} \delta \mathbf{w} + w_{\mathbf{q}_{i}}\right)\\
&= \frac{1}{2} \delta \mathbf{w}^{\top} \mathbf{H}\left(\mathbf{w}\right) \delta \mathbf{w} + \boldsymbol{\lambda}^{\top} \left(\mathbf{e}_{\mathbf{q}}^{\top} \delta \mathbf{w} + \mathbf{w}_{\mathbf{q}}\right)\\
\end{align}
$$
In this way, the Lagrangian function for the structured pruning case has exactly the same form as the unstructured pruning case.
Taking the gradient of the Lagrangian with respect to $\delta \mathbf{w}$, setting it to zero, and applying vector calculus, we have:
$$
\begin{align}
\nabla_{\delta \mathbf{w}} \mathcal{L}_{\mathbf{q}}(\delta \mathbf{w}, \boldsymbol{\lambda})
&= \mathbf{H}\left(\mathbf{w}\right) \delta \mathbf{w} + \left( \boldsymbol{\lambda}^{\top} \mathbf{e}_{\mathbf{q}}^{\top} \right)^{\top} \\
&= \mathbf{H}\left(\mathbf{w}\right) \delta \mathbf{w} + \mathbf{e}_{\mathbf{q}} \boldsymbol{\lambda} = \mathbf{0} \\
\end{align}
$$
Using the assumption that the Hessian matrix is invertible, we can rearrange this equation to solve for $\delta \mathbf{w}$:
$$
\begin{align}
\delta \mathbf{w}
&= -\mathbf{H}\left(\mathbf{w}\right)^{-1} \mathbf{e}_{\mathbf{q}} \boldsymbol{\lambda}\\
\end{align}
$$
In this way, the expression for $\delta \mathbf{w}$ in the structured pruning case has exactly the same form as the unstructured pruning case.
Taking the gradient of the Lagrangian with respect to $\boldsymbol{\lambda}$, setting it to zero, and applying vector calculus, we have:
$$
\begin{align}
\nabla_{\boldsymbol{\lambda}} \mathcal{L}_{\mathbf{q}}(\delta \mathbf{w}, \boldsymbol{\lambda})
&= \mathbf{e}_{\mathbf{q}}^{\top} \delta \mathbf{w} + \mathbf{w}_{\mathbf{q}} = \mathbf{0} \\
\end{align}
$$
Substituting the expression for $\delta \mathbf{w}$ into this constraint gives us:
$$
\begin{align}
\nabla_{\boldsymbol{\lambda}} \mathcal{L}_{\mathbf{q}}(\delta \mathbf{w}, \boldsymbol{\lambda})
&= \mathbf{e}_{\mathbf{q}}^{\top} \left(-\mathbf{H}\left(\mathbf{w}\right)^{-1} \mathbf{e}_{\mathbf{q}} \boldsymbol{\lambda}\right) + \mathbf{w}_{\mathbf{q}} \\
&= -\mathbf{e}_{\mathbf{q}}^{\top} \mathbf{H}\left(\mathbf{w}\right)^{-1} \mathbf{e}_{\mathbf{q}} \boldsymbol{\lambda} + \mathbf{w}_{\mathbf{q}} \\
&= -\left(\mathbf{H}\left(\mathbf{w}\right)^{-1}\right)_{\mathbf{q} \mathbf{q}} \boldsymbol{\lambda} + \mathbf{w}_{\mathbf{q}} \\
&= \mathbf{0} \\
\end{align}
$$
Therefore, we can solve for $\boldsymbol{\lambda}$:
$$
\begin{align}
\boldsymbol{\lambda}
&= \left(\mathbf{H}\left(\mathbf{w}\right)^{-1}\right)_{\mathbf{q} \mathbf{q}}^{-1} \mathbf{w}_{\mathbf{q}} \\
\end{align}
$$
where $\left(\mathbf{H}\left(\mathbf{w}\right)^{-1}\right)_{\mathbf{q} \mathbf{q}} \in \mathbb{R}^{m \times m}$ is the submatrix of $\mathbf{H}\left(\mathbf{w}\right)^{-1}$ whose rows and columns are indexed by $\mathbf{q}$.
Substituting this expression for $\boldsymbol{\lambda}$ back into the expression for $\delta \mathbf{w}$ gives us:
$$
\begin{align}
\delta \mathbf{w}
&= -\mathbf{H}\left(\mathbf{w}\right)^{-1} \mathbf{e}_{\mathbf{q}} \boldsymbol{\lambda} \\
&= -\mathbf{H}\left(\mathbf{w}\right)^{-1} \mathbf{e}_{\mathbf{q}} \left(\mathbf{H}\left(\mathbf{w}\right)^{-1}\right)_{\mathbf{q} \mathbf{q}}^{-1} \mathbf{w}_{\mathbf{q}} \\
\end{align}
$$
In this way, the expression for $\delta \mathbf{w}$ in the structured pruning case has exactly the same form as the unstructured pruning case.
The resulting $\mathcal{L}_{\mathbf{q}}(\delta \mathbf{w}, \boldsymbol{\lambda})$, denoted as the “saliency” of the $\mathbf{q}$-th weight, is given by:
$$
\begin{align}
\mathcal{L}_{\mathbf{q}}(\delta \mathbf{w}, \boldsymbol{\lambda})
&= \frac{1}{2} \delta \mathbf{w}^{\top} \mathbf{H}\left(\mathbf{w}\right) \delta \mathbf{w} + \boldsymbol{\lambda}^{\top} \left(\mathbf{e}_{\mathbf{q}}^{\top} \delta \mathbf{w} + \mathbf{w}_{\mathbf{q}}\right) \\
&= \frac{1}{2} \left(-\mathbf{H}\left(\mathbf{w}\right)^{-1} \mathbf{e}_{\mathbf{q}} \left(\mathbf{H}\left(\mathbf{w}\right)^{-1}\right)_{\mathbf{q} \mathbf{q}}^{-1} \mathbf{w}_{\mathbf{q}}\right)^{\top} \mathbf{H}\left(\mathbf{w}\right) \left(-\mathbf{H}\left(\mathbf{w}\right)^{-1} \mathbf{e}_{\mathbf{q}} \left(\mathbf{H}\left(\mathbf{w}\right)^{-1}\right)_{\mathbf{q} \mathbf{q}}^{-1} \mathbf{w}_{\mathbf{q}}\right) + \mathbf{0} \\
&= \frac{1}{2} \mathbf{w}_{\mathbf{q}}^{\top} \left(\left(\mathbf{H}\left(\mathbf{w}\right)^{-1}\right)_{\mathbf{q} \mathbf{q}}^{-1}\right)^{\top} \mathbf{e}_{\mathbf{q}}^{\top} \left(\mathbf{H}\left(\mathbf{w}\right)^{-1}\right)^{\top} \mathbf{H}\left(\mathbf{w}\right) \mathbf{H}\left(\mathbf{w}\right)^{-1} \mathbf{e}_{\mathbf{q}} \left(\mathbf{H}\left(\mathbf{w}\right)^{-1}\right)_{\mathbf{q} \mathbf{q}}^{-1} \mathbf{w}_{\mathbf{q}} \\
&= \frac{1}{2} \mathbf{w}_{\mathbf{q}}^{\top} \left(\left(\mathbf{H}\left(\mathbf{w}\right)^{-1}\right)_{\mathbf{q} \mathbf{q}}^{-1}\right)^{\top} \mathbf{e}_{\mathbf{q}}^{\top} \left(\mathbf{H}\left(\mathbf{w}\right)^{-1}\right)^{\top} \mathbf{e}_{\mathbf{q}} \left(\mathbf{H}\left(\mathbf{w}\right)^{-1}\right)_{\mathbf{q} \mathbf{q}}^{-1} \mathbf{w}_{\mathbf{q}} \\
\end{align}
$$
Using the property that inverse of a positive definite matrix is also positive definite, we learned $\mathbf{H}\left(\mathbf{w}\right)^{-1}$ is also positive definite and symmetric. Therefore, we could further simplify the above expression as follows:
$$
\begin{align}
\mathcal{L}_{\mathbf{q}}(\delta \mathbf{w}, \boldsymbol{\lambda})
&= \frac{1}{2} \mathbf{w}_{\mathbf{q}}^{\top} \left(\left(\mathbf{H}\left(\mathbf{w}\right)^{-1}\right)_{\mathbf{q} \mathbf{q}}^{-1}\right)^{\top} \mathbf{e}_{\mathbf{q}}^{\top} \left(\mathbf{H}\left(\mathbf{w}\right)^{-1}\right)^{\top} \mathbf{e}_{\mathbf{q}} \left(\mathbf{H}\left(\mathbf{w}\right)^{-1}\right)_{\mathbf{q} \mathbf{q}}^{-1} \mathbf{w}_{\mathbf{q}} \\
&= \frac{1}{2} \mathbf{w}_{\mathbf{q}}^{\top} \left(\left(\mathbf{H}\left(\mathbf{w}\right)^{-1}\right)_{\mathbf{q} \mathbf{q}}^{-1}\right)^{\top} \mathbf{e}_{\mathbf{q}}^{\top} \left(\mathbf{H}\left(\mathbf{w}\right)^{-1}\right) \mathbf{e}_{\mathbf{q}} \left(\mathbf{H}\left(\mathbf{w}\right)^{-1}\right)_{\mathbf{q} \mathbf{q}}^{-1} \mathbf{w}_{\mathbf{q}} \\
&= \frac{1}{2} \mathbf{w}_{\mathbf{q}}^{\top} \left(\left(\mathbf{H}\left(\mathbf{w}\right)^{-1}\right)_{\mathbf{q} \mathbf{q}}^{-1}\right)^{\top} \left(\mathbf{H}\left(\mathbf{w}\right)^{-1}\right)_{\mathbf{q} \mathbf{q}} \left(\mathbf{H}\left(\mathbf{w}\right)^{-1}\right)_{\mathbf{q} \mathbf{q}}^{-1} \mathbf{w}_{\mathbf{q}} \\
&= \frac{1}{2} \mathbf{w}_{\mathbf{q}}^{\top} \left(\left(\mathbf{H}\left(\mathbf{w}\right)^{-1}\right)_{\mathbf{q} \mathbf{q}}^{-1}\right)^{\top} \mathbf{w}_{\mathbf{q}} \\
\end{align}
$$
There is another lemma which could be used to further simplify the above expression. If $A$ is a positive definite matrix, then its submatrix $A_{\mathbf{q} \mathbf{q}}$ whose rows and columns are indexed by $\mathbf{q}$ is also positive definite.
Because $\mathbf{H}\left(\mathbf{w}\right)^{-1}$ is positive definite and symmetric, we learned that $\left(\mathbf{H}\left(\mathbf{w}\right)^{-1}\right)_{\mathbf{q} \mathbf{q}}$ is positive definite and symmetric, and $\left(\mathbf{H}\left(\mathbf{w}\right)^{-1}\right)_{\mathbf{q} \mathbf{q}}^{-1}$ is also positive definite and symmetric. Therefore, we could further simplify the above expression as follows:
$$
\begin{align}
\mathcal{L}_{\mathbf{q}}(\delta \mathbf{w}, \boldsymbol{\lambda})
&= \frac{1}{2} \mathbf{w}_{\mathbf{q}}^{\top} \left(\left(\mathbf{H}\left(\mathbf{w}\right)^{-1}\right)_{\mathbf{q} \mathbf{q}}^{-1}\right)^{\top} \mathbf{w}_{\mathbf{q}} \\
&= \frac{1}{2} \mathbf{w}_{\mathbf{q}}^{\top} \left(\mathbf{H}\left(\mathbf{w}\right)^{-1}\right)_{\mathbf{q} \mathbf{q}}^{-1} \mathbf{w}_{\mathbf{q}} \\
\end{align}
$$
In this way, the resulting $\mathcal{L}_{\mathbf{q}}(\delta \mathbf{w}, \boldsymbol{\lambda})$ has exactly the same form as the unstructured pruning case.
Therefore, we have completed deriving the saliency of the $\mathbf{q}$-th weight and the loss difference $\delta E$ for the structured pruning case, which has not been discussed and derived in the original paper.
Computing the saliency of weights and updating weights requires computing the inverse of the Hessian matrix and the inverse of the Hessian inverse submatrix corresponding to the weights to be pruned. Because the Hessian inverse submatrix corresponding to the weights to be pruned is small, computing the inverse of the Hessian inverse submatrix is usually fast. However, because the Hessian matrix is typically large, especially in a modern neural network with millions or even billions of parameters, computing the inverse of the Hessian matrix becomes extremely computationally expensive.
Computing The Inverse Of The Hessian Matrix
Computing the Hessian matrix of a neural network that consists of millions or even billions of parameters is computationally expensive. Suppose we have $n$ parameters, the Hessian matrix is of size $n \times n$, and we will have to perform the backpropagation algorithm $n$ times to compute the exact Hessian matrix.
The inverse of the Hessian matrix is also extremely expensive to compute. A naive way to compute the inverse of the Hessian matrix is to use the Gaussian elimination method, which has a time complexity of $O(n^3)$. There are some other methods that are asymptotically faster, but they are still greater than $O(n^2)$.
By making some assumptions, the Optimal Brain Surgeon paper proposed to compute the inverse of the Hessian matrix by iteratively computing the covariance matrix of the gradients of the weights for each sample in the calibration dataset. Suppose we have $m$ samples in the calibration dataset, and we will have to perform the backpropagation algorithm $m$ times. As long as $m \ll n$, the asymptotic time complexity of computing the inverse of the Hessian matrix is already faster than computing the Hessian matrix using the naive method, not to mention the additional time complexity of computing the inverse of the Hessian matrix required by the naive method.
Consider a neural network $f$ with a weight vector $\mathbf{w}$ that maps an input vector $\mathbf{x}$ to an output vector $\mathbf{y}$.
$$
\begin{align}
\mathbf{y} &= f\left(\mathbf{w}, \mathbf{x}\right) \\
\end{align}
$$
We assume $\mathbf{w} \in \mathbb{R}^{n \times 1}$, $\mathbf{x} \in \mathbb{R}^{n_{\text{in}} \times 1}$, and $\mathbf{y} \in \mathbb{R}^{n_{\text{out}} \times 1}$, where $n_{i}$ is the number of input features and $n_{o}$ is the number of output features.
Given a calibration dataset of $P$ samples, $\{\{\mathbf{x}_{i}, \mathbf{y}_{i}\}\}_{i=1}^{P}$, the mean squared error (MSE) loss function is defined as follows:
$$
\begin{align}
E\left(\mathbf{w}\right)
&= \frac{1}{2P} \sum_{i=1}^{P} \left(f\left(\mathbf{w}, \mathbf{x}_{i}\right) - \mathbf{y}_{i}\right)^{\top} \left(f\left(\mathbf{w}, \mathbf{x}_{i}\right) - \mathbf{y}_{i}\right) \\
\end{align}
$$
Using vector calculus chain rule, the first-order derivative, i.e., the gradient, of the loss function with respect to the weight vector $\mathbf{w}$ is given by:
$$
\begin{align}
\nabla_{\mathbf{w}} E\left(\mathbf{w}\right)
&= \frac{\partial E\left(\mathbf{w}\right)}{\partial \mathbf{w}} \\
&= \frac{1}{2P} \sum_{i=1}^{P} \frac{\partial \left(f\left(\mathbf{w}, \mathbf{x}_{i}\right) - \mathbf{y}_{i}\right)^{\top}}{\partial \mathbf{w}} \left(2 \left(f\left(\mathbf{w}, \mathbf{x}_{i}\right) - \mathbf{y}_{i}\right) \right) \\
&= \frac{1}{P} \sum_{i=1}^{P} \frac{\partial f\left(\mathbf{w}, \mathbf{x}_{i}\right)^{\top}}{\partial \mathbf{w}} \left(f\left(\mathbf{w}, \mathbf{x}_{i}\right) - \mathbf{y}_{i}\right) \\
&= \frac{1}{P} \sum_{i=1}^{P} \nabla_{\mathbf{w}} f\left(\mathbf{w}, \mathbf{x}_{i}\right)^{\top} \left(f\left(\mathbf{w}, \mathbf{x}_{i}\right) - \mathbf{y}_{i}\right) \\
\end{align}
$$
where $\nabla_{\mathbf{w}} E\left(\mathbf{w}\right) \in \mathbb{R}^{n \times 1}$ is the gradient of the loss function with respect to the weight vector $\mathbf{w}$, and $\nabla_{\mathbf{w}} f\left(\mathbf{w}, \mathbf{x}_{i}\right) \in \mathbb{R}^{n_{\text{out}} \times n}$ is the Jacobian matrix of the neural network with respect to the weight vector $\mathbf{w}$.
Using vector calculus chain rule, the second-order derivative, i.e., the Hessian matrix, of the loss function with respect to the weight vector $\mathbf{w}$ is given by:
$$
\begin{align}
\mathbf{H}\left(\mathbf{w}\right)
&= \frac{\partial^2 E\left(\mathbf{w}\right)}{\partial \mathbf{w}^2} \\
&= \frac{\partial \left( \frac{ \partial E\left(\mathbf{w}\right)}{\partial \mathbf{w}} \right)}{\partial \mathbf{w}} \\
&= \frac{\partial \nabla_{\mathbf{w}} E\left(\mathbf{w}\right)}{\partial \mathbf{w}} \\
&= \frac{1}{P} \sum_{i=1}^{P} \nabla_{\mathbf{w}} f\left(\mathbf{w}, \mathbf{x}_{i}\right)^{\top} \frac{\partial \left(f\left(\mathbf{w}, \mathbf{x}_{i}\right) - \mathbf{y}_{i}\right)}{\partial \mathbf{w}} + \frac{\partial \nabla_{\mathbf{w}} f\left(\mathbf{w}, \mathbf{x}_{i}\right)^{\top}}{\partial \mathbf{w}} \left(f\left(\mathbf{w}, \mathbf{x}_{i}\right) - \mathbf{y}_{i}\right) \\
&= \frac{1}{P} \sum_{i=1}^{P} \nabla_{\mathbf{w}} f\left(\mathbf{w}, \mathbf{x}_{i}\right)^{\top} \nabla_{\mathbf{w}} f\left(\mathbf{w}, \mathbf{x}_{i}\right) + \frac{\partial \nabla_{\mathbf{w}} f\left(\mathbf{w}, \mathbf{x}_{i}\right)^{\top}}{\partial \mathbf{w}} \left(f\left(\mathbf{w}, \mathbf{x}_{i}\right) - \mathbf{y}_{i}\right)\\
&= \frac{1}{P} \sum_{i=1}^{P} \nabla_{\mathbf{w}} f\left(\mathbf{w}, \mathbf{x}_{i}\right)^{\top} \nabla_{\mathbf{w}} f\left(\mathbf{w}, \mathbf{x}_{i}\right) + \frac{\partial^2 f\left(\mathbf{w}, \mathbf{x}_{i}\right)^{\top}}{\partial \mathbf{w}^2} \left(f\left(\mathbf{w}, \mathbf{x}_{i}\right) - \mathbf{y}_{i}\right)\\
\end{align}
$$
where $\mathbf{H}\left(\mathbf{w}\right) \in \mathbb{R}^{n \times n}$ is the Hessian matrix of the loss function with respect to the weight vector $\mathbf{w}$.
Computing $\frac{\partial^2 f\left(\mathbf{w}, \mathbf{x}_{i}\right)^{\top}}{\partial \mathbf{w}^2}$ is even more expensive than computing the Hessian matrix $\mathbf{H}\left(\mathbf{w}\right)$ because the output of the neural network is a vector instead of a scalar. However, the Optimal Brain Surgeon paper assumes $f\left(\mathbf{w}, \mathbf{x}_{i}\right) - \mathbf{y}_{i} = 0$ when a neural network is well trained and the second term in the above equation becomes negligible. Therefore, the Hessian matrix $\mathbf{H}\left(\mathbf{w}\right)$ can be approximated as follows:
$$
\begin{align}
\mathbf{H}\left(\mathbf{w}\right)
&\approx \frac{1}{P} \sum_{i=1}^{P} \nabla_{\mathbf{w}} f\left(\mathbf{w}, \mathbf{x}_{i}\right)^{\top} \nabla_{\mathbf{w}} f\left(\mathbf{w}, \mathbf{x}_{i}\right)\\
\end{align}
$$
Now, it seems that we can compute the Hessian matrix $\mathbf{H}\left(\mathbf{w}\right)$ by computing the covariance matrix of the gradients of the weights for each sample in the calibration dataset, which requires performing the backpropagation algorithm $P$ times. Then the inverse of the Hessian matrix can be computed using the Gaussian elimination method. It turns out that the inverse of the Hessian matrix can also be eliminated.
We will rewrite the above equation to an iterative accumulation form:
$$
\begin{align}
\mathbf{H}\left(\mathbf{w}\right)_{t+1} = \mathbf{H}\left(\mathbf{w}\right)_{t} + \nabla_{\mathbf{w}} f\left(\mathbf{w}, \mathbf{x}_{t}\right)^{\top} \frac{\mathbf{I}}{P} \nabla_{\mathbf{w}} f\left(\mathbf{w}, \mathbf{x}_{t}\right)\\
\end{align}
$$
where $\mathbf{H}\left(\mathbf{w}\right)_{t}$ is the Hessian matrix at the $t$-th iteration, $\mathbf{H}\left(\mathbf{w}\right)_{0} = \alpha \mathbf{I}$, and $\alpha$ is a small positive constant so that $\mathbf{H}\left(\mathbf{w}\right)_{0}$ is not zero and invertible. $\mathbf{H}\left(\mathbf{w}\right) = \mathbf{H}\left(\mathbf{w}\right)_{P}$ is the final Hessian matrix after $P$ iterations.
Because of the Woodbury matrix identity,
$$
\begin{align}
\left(A + B C D\right)^{-1} &= A^{-1} - A^{-1} B \left(C^{-1} + D A^{-1} B\right)^{-1} D A^{-1} \\
\end{align}
$$
In our case, we have $A = \mathbf{H}\left(\mathbf{w}\right)_{t}$, $B = \nabla_{\mathbf{w}} f\left(\mathbf{w}, \mathbf{x}_{t}\right)^{\top}$, $C = \frac{\mathbf{I}}{P}$, and $D = \nabla_{\mathbf{w}} f\left(\mathbf{w}, \mathbf{x}_{t}\right)$.
Using the Woodbury matrix identity, we can compute the inverse of the Hessian matrix as follows:
$$
\begin{align}
\mathbf{H}\left(\mathbf{w}\right)_{t+1}^{-1} &= \mathbf{H}\left(\mathbf{w}\right)_{t}^{-1} - \mathbf{H}\left(\mathbf{w}\right)_{t}^{-1} \nabla_{\mathbf{w}} f\left(\mathbf{w}, \mathbf{x}_{t}\right)^{\top} \left( \left(\frac{\mathbf{I}}{P}\right)^{-1} + \nabla_{\mathbf{w}} f\left(\mathbf{w}, \mathbf{x}_{t}\right) \mathbf{H}\left(\mathbf{w}\right)_{t}^{-1} \nabla_{\mathbf{w}} f\left(\mathbf{w}, \mathbf{x}_{t}\right)^{\top} \right)^{-1} \nabla_{\mathbf{w}} f\left(\mathbf{w}, \mathbf{x}_{t}\right)\\
&= \mathbf{H}\left(\mathbf{w}\right)_{t}^{-1} - \mathbf{H}\left(\mathbf{w}\right)_{t}^{-1} \nabla_{\mathbf{w}} f\left(\mathbf{w}, \mathbf{x}_{t}\right)^{\top} \left( P \mathbf{I} + \nabla_{\mathbf{w}} f\left(\mathbf{w}, \mathbf{x}_{t}\right) \mathbf{H}\left(\mathbf{w}\right)_{t}^{-1} \nabla_{\mathbf{w}} f\left(\mathbf{w}, \mathbf{x}_{t}\right)^{\top} \right)^{-1} \nabla_{\mathbf{w}} f\left(\mathbf{w}, \mathbf{x}_{t}\right)\\
\end{align}
$$
where $\mathbf{H}\left(\mathbf{w}\right)_{t}^{-1}$ is the inverse of the Hessian matrix at the $t$-th iteration, $\mathbf{H}\left(\mathbf{w}\right)_{0}^{-1} = \frac{1}{\alpha} \mathbf{I}$, and $\mathbf{H}\left(\mathbf{w}\right)^{-1} = \mathbf{H}\left(\mathbf{w}\right)_{P}^{-1}$ is the final inverse of the Hessian matrix after $P$ iterations.
Note that because the inverse of the Hessian matrix is computed iteratively using gradients with the above formula, matrix inverse was actually never computed explicitly, therefore, inverse of the Hessian matrix always “exists”. This also makes Optimal Brain Surgeon assumptions that the Hessian matrix is positive definite and invertible less important.
Optimal Brain Surgeon VS Optimal Brain Damage
The Optimal Brain Surgeon and Optimal Brain Damage are two methods for pruning neural networks. They both use the second-order derivative of the loss function with respect to the weights, i.e., the Hessian matrix, to compute the saliency of the weights. To some extent, they are similar to each other. However, because the assumptions of the two algorithms are slightly different, the mathematical formulations of the two algorithms are also different.
The Optimal Brain Damage algorithm assumes the Hessian matrix is diagonal and non-negative. Such Hessian matrix is positive semi-definite.
The Optimal Brain Surgeon algorithm assumes the Hessian matrix is positive definite, but does not assume that the Hessian matrix is diagonal. The assumption used in the Optimal Brain Surgeon algorithm is less strict than the assumption used in the Optimal Brain Damage algorithm.
References
Optimal Brain Surgeon