TensorRT Implicit Weight Quantization
Introduction
Before TensorRT 10 introduced explicit quantization which allows the user to specify where exactly to quantize in the neural network and what quantization parameters to use, TensorRT performs implicit quantization which is rule based and sometimes the quantization behavior will not be what the user expected. Specifically for weight quantization, the scale factor was computed using a fixed formula. The user could not specify the scale factor for weight quantization and therefore some advanced quantization techniques might not be possible to be accelerated by TensorRT.
In this blog post, I would like to share a trick that allows the user to overcome this limitation and use custom scale factors implicitly for TensorRT implicit quantization.
TensorRT Implicit Weight Quantization
Given a weight tensor $x$, the way TensorRT quantize weights can be described as computing the scale factor $s$ for symmetric quantization and quantizing the weight tensor to a quantized weight tensor $x_{q}$ using the scale factor $s$.
Concretely, for INT8 symmetric quantization, the scale factor $s$ is computed as
$$
s = \frac{\max\left(\left\lvert x \right\rvert\right)}{127}
$$
The quantized weight tensor $x_{q}$ is computed as
$$
x_{q} = \text{round}\left(\text{clip}\left(\frac{x}{s}, -128, 127\right)\right)
$$
What if we want to use a custom scale factor $s^{\prime} > 0$ which is different from the scale factor $s$ for quantization? There are two scenarios, $s^{\prime} < s$ and $s^{\prime} > s$.
When $s^{\prime} < s$, this implies that there are there are “outliers” in the weight tensor $x$ whose absolute value is larger than $127 s^{\prime}$. In this case, we have two proposed approaches.
One approach is to clip the weight tensor $x$ to a new tensor $x^{\prime}$ whose range is $[-127 s^{\prime}, 127 s^{\prime}]$. Concretely, the new weight tensor $x^{\prime}$ is computed as
$$
x^{\prime} = \text{clip}\left(x, -127 s^{\prime}, 127 s^{\prime}\right)
$$
Proof
Because $s^{\prime} < s$, we must have $\max\left(\left\lvert x^{\prime} \right\rvert\right) = 127 s^{\prime} \leq \max\left(\left\lvert x \right\rvert\right)$. Therefore, the scale factor that TensorRT computes from the new weight tensor $x^{\prime}$ is $s^{\prime}$, i.e., $s^{\prime} = \frac{\max\left(\left\lvert x^{\prime} \right\rvert\right)}{127}$.
In addition, we have to show that the following equation holds.
$$
s^{\prime} \text{round}\left(\text{clip}\left(\frac{x^{\prime}}{s^{\prime}}, -128, 127\right)\right) = s^{\prime} \text{round}\left(\text{clip}\left(\frac{x}{s^{\prime}}, -128, 127\right)\right)
$$
To see this, on the left hand side, we have
$$
\begin{align}
s^{\prime} \text{round}\left(\text{clip}\left(\frac{x^{\prime}}{s^{\prime}}, -128, 127\right)\right)
&= s^{\prime} \text{round}\left(\text{clip}\left(\frac{\text{clip}\left(x, -127 s^{\prime}, 127 s^{\prime}\right)}{s^{\prime}}, -128, 127\right)\right) \\
&= s^{\prime} \text{round}\left(\text{clip}\left(\text{clip}\left(\frac{x}{s^{\prime}}, -127, 127\right), -128, 127\right)\right) \\
&= s^{\prime} \text{round}\left(\text{clip}\left(\frac{x}{s^{\prime}}, -127, 127\right)\right) \\
\end{align}
$$
On the right hand side, we actually don’t have
$$
\begin{align}
s^{\prime} \text{round}\left(\text{clip}\left(\frac{x}{s^{\prime}}, -128, 127\right)\right)
&= s^{\prime} \text{round}\left(\text{clip}\left(\frac{x}{s^{\prime}}, -127, 127\right)\right) \\
\end{align}
$$
This is because $\max\left(\left\lvert x \right\rvert\right) = 127s \geq 127s^{\prime}$ and the range of $\frac{x}{s^{\prime}}$ is $[-127 \frac{s}{s^{\prime}}, 127 \frac{s}{s^{\prime}}]$. For some negative corner values, it will be clipped to $-128$. But we could still say, in most of the cases, for most of the values in the weight tensor $x$, the two sides are equal.
This concludes the proof. $\square$
The other approach is to quantize and dequantize the weight tensor $x$ using the scale factor $s^{\prime}$. Concretely, the new weight tensor $x^{\prime}$ is computed as
$$
x^{\prime} = s^{\prime} \text{round}\left(\text{clip}\left(\frac{x}{s^{\prime}}, -128, 127\right)\right)
$$
Proof
We could first show that the following equation holds.
$$
s^{\prime} \text{round}\left(\text{clip}\left(\frac{x^{\prime}}{s^{\prime}}, -128, 127\right)\right) = s^{\prime} \text{round}\left(\text{clip}\left(\frac{x}{s^{\prime}}, -128, 127\right)\right)
$$
On the left hand side, we have
$$
\begin{align}
s^{\prime} \text{round}\left(\text{clip}\left(\frac{x^{\prime}}{s^{\prime}}, -128, 127\right)\right)
&= s^{\prime} \text{round}\left(\text{clip}\left(\frac{s^{\prime} \text{round}\left(\text{clip}\left(\frac{x}{s^{\prime}}, -128, 127\right)\right)}{s^{\prime}}, -128, 127\right)\right) \\
&= s^{\prime} \text{round}\left(\text{clip}\left(\text{round}\left(\text{clip}\left(\frac{x}{s^{\prime}}, -128, 127\right)\right), -128, 127\right)\right) \\
&= s^{\prime} \text{round}\left(\text{clip}\left(\frac{x}{s^{\prime}}, -128, 127\right)\right) \\
\end{align}
$$
which is exactly the right hand side.
In addition, we have to show that the scale factor that TensorRT computes from the new weight tensor $x^{\prime}$ is $s^{\prime}$, i.e., $s^{\prime} = \frac{\max\left(\left\lvert x^{\prime} \right\rvert\right)}{127}$.
$$
\begin{align}
\max\left(\lvert x^{\prime} \rvert\right) &= \max\left(\left\lvert s^{\prime} \text{round}\left(\text{clip}\left(\frac{x}{s^{\prime}}, -128, 127\right)\right) \right\rvert\right) \\
&= s^{\prime} \max\left(\left\lvert \text{round}\left(\text{clip}\left(\frac{x}{s^{\prime}}, -128, 127\right)\right) \right\rvert\right) \\
&= s^{\prime} \left(\text{round}\left(\text{clip}\left(\frac{\max\left(\left\lvert x \right\rvert\right)}{s^{\prime}}, -128, 127\right)\right) \right) \\
&= s^{\prime} \left(\text{round}\left(\text{clip}\left(\frac{127s}{s^{\prime}}, -128, 127\right)\right) \right) \\
&= s^{\prime} \left(\text{round}\left(127\right) \right) \\
&= 127 s^{\prime}
\end{align}
$$
Therefore, the scale factor that TensorRT computes from the new weight tensor $x^{\prime}$ is $s^{\prime}$, i.e., $s^{\prime} = \frac{\max\left(\left\lvert x^{\prime} \right\rvert\right)}{127}$.
This concludes the proof. $\square$
So obviously, the second approach should be preferred because it is always correct.
When $s^{\prime} > s$, this implies that the quantized weight tensor $x_q$ will have a smaller range than the INT8 bit range $[-128, 127]$. This means something is probably incorrect and some bitwidth are wasted. In this case, it is impossible to use a custom scale factor $s^{\prime} > s$ for TensorRT implicit quantization. However, if we use the weight tensor $x$ as it is, TensorRT will use the scale factor $s$ instead of $s^{\prime}$ for quantization and the resolution of the weight tensor quantization will be better than the one using the scale factor $s^{\prime}$. This usually will not reduce the accuracy of the quantized operation. To make sure the quantization behaviors are close, it is optional to run quantization and dequantization on the weight tensor $x$ and use this new weight tensor $x^{\prime}$ for TensorRT implicit quantization. Concretely, the new weight tensor $x^{\prime}$ is computed as
$$
x^{\prime} = s^{\prime} \text{round}\left(\text{clip}\left(\frac{x}{s^{\prime}}, -128, 127\right)\right)
$$
Conclusions
To use any custom scale factor $s^{\prime}$ for TensorRT implicit weight quantization before TensorRT 10, given the weight tensor $x$, we could quantize and dequantize the weight tensor $x$ using the scale factor $s^{\prime}$ and use the resulting weight tensor $x^{\prime}$ for TensorRT implicit quantization. This trick preserves the desired quantization behaviors as much as possible.
References
TensorRT Implicit Weight Quantization
https://leimao.github.io/blog/TensorRT-Implicit-Weight-Quantization/