Parameter Importance Approximation Via Taylor Expansion In Neural Network Pruning

Introduction

In neural network pruning, usually we have to evaluate the importance of the parameters in a neural network using some criteria and remove the parameters with the smallest importance. In addition to simply evaluating the parameter importance as the absolute value of the parameter, we can also evaluate the parameter importance as the change in the loss function when the parameter is removed, which makes more sense intuitively.

However, such evaluation is usually computationally expensive. In this blog post, I would like to discuss how to approximate the importance of the parameters in a neural network using Taylor expansion to accelerate the parameter importance evaluation process.

Parameter Importance Evaluation

The importance of a parameter in a neural network can be quantified by the change in the loss function when the parameter is removed.

The importance of a parameter wm in a neural network if a function of wm denoted as I(wm). It can be formulated as follows.

I(wm)=(L(D,W)L(D,W|wm=0))2

where L(D,W) is the loss function of the neural network with all parameters W, and L(D,W|wm=0) is the loss function of the neural network with the parameter wm being set to zero.

More generally, the parameters can be grouped and the importance of a group of parameters can be quantified by the change in the loss function when the group of parameters WS is removed.

I(WS)=(L(D,W)L(D,W|WS=0))2

where L(D,W) is the loss function of the neural network with all parameters W, and L(D,W|WS=0) is the loss function of the neural network with the group of parameters WS being set to zero.

Quantifying the importance of all the parameters in a neural network can be computationally expensive, because we have to apply the above formula to each parameter or each group of the parameters in the neural network and the number of parameters or the number of groups of the parameters in a neural network can be very large.

We need to find a way to approximate the importance of the parameters in a neural network so that we can reduce the computational cost.

Parameter Importance Approximation Via Taylor Expansion

We define the difference between the loss function of the neural network with all parameters and the loss function of the neural network with the parameter WS that are subject to change a function of the parameter WS denoted as f(WS).

f(WS)=L(D,W)L(D,W|WS) 

Note that

f(0)=L(D,W)L(D,W|WS=0)

We can approximate the function f(WS) using the Taylor expansion in which WS evaluated at the original parameter values in W denoted as WS.

f(WS)=f(WS)+f(WS)(WSWS)+12(WSWS)2f(WS)(WSWS)+

We notice that f(WS)=0. Therefore, we have

f(WS)=f(WS)(WSWS)+12(WSWS)2f(WS)(WSWS)+

This function evaluated at WS=0 is

f(0)=f(WS)WS+12WS2f(WS)WS+

The first order derivative f(WS) is the gradient of the loss function L with respect to the parameters WS.

f(WS)=f(WS)WS=(L(D,W)L(D,W|WS))WS=L(D,W)WSL(D,W|WS)WS=0L(D,W|WS)WS=L(D,W|WS)WS=L(D,W|WS)

The first order derivative f(WS) evaluated at WS, f(WS), happens to have already been computed in the backpropagation during the neural network training and we can get it for free.

Similarly, the second order derivative 2f(WS) is the Hessian of the loss function L with respect to the parameters WS.

2f(WS)=2L(D,W|WS)

The second order derivative 2f(WS) evaluated at WS is not free to compute. So usually we only take the first order derivative term in the Taylor expansion for approximation.

Thus, we have

f(0)L(D,W|WS=WS)WS

and the importance of the group of parameters WS can be approximated as

I(WS)=(f(0))2(L(D,W|WS=WS)WS)2

Note that this is just the square of a dot product between the vector of gradient with respect to the group of parameters WS and the vector of the values of the group of parameters WS, which is straightforward to compute.

An alternative way to derive these formulas is just to perform Taylor expansion on the loss function L with respect to the group of parameters WS.

L(D,W|WS)=L(D,W|WS=WS)+L(D,W|WS=WS)(WSWS)+12(WSWS)2L(D,W|WS=WS)(WSWS)+

When the group of parameters WS is set to zero, we have

L(D,W|WS=0)=L(D,W|WS=WS)+L(D,W|WS=WS)(WS)+12(WS)2L(D,W|WS=WS)(WS)+=L(D,W|WS=WS)L(D,W|WS=WS)WS+12WS2L(D,W|WS=WS)WS+

Thus, the change in the loss function f(0) can be approximated as

f(0)=L(D,W)L(D,W|WS=0)=L(D,W|WS=WS)L(D,W|WS=0)=L(D,W|WS=WS)(L(D,W|WS=WS)L(D,W|WS=WS)WS+12WS2L(D,W|WS=WS)WS+)=L(D,W|WS=WS)WS+12WS2L(D,W|WS=WS)WS+L(D,W|WS=WS)WS

Therefore, computing the importance of the group of parameters WS can be much faster than computing the importance of the group of parameters WS using the original formula.

Conclusions

Structured neural network pruning usually requires grouping the parameters in the neural network and pruning the group of parameters with the smallest importance. The importance of the group of parameters can be approximated using the Taylor expansion, which can be much faster than computing the importance of the group of parameters using the original non-approximated formula.

References

Parameter Importance Approximation Via Taylor Expansion In Neural Network Pruning

https://leimao.github.io/blog/Parameter-Importance-Approximation-Via-Taylor-Expansion-In-Neural-Network-Pruning/

Author

Lei Mao

Posted on

06-20-2024

Updated on

07-07-2024

Licensed under


Comments