# Quantization Unit Test

## Introduction

When implementing quantization operations, creating unit test is often a headache. Because often the time, there is no ground truth quantization operation reference implementations to compare the results with.

In this blog post, I would like to quickly discuss how to test quantization with or sometimes without floating point operation reference implementations.

## Quantization Unit Test

### The Correct Approach

To test quantization using floating point operation reference implementations, the idea is similar to the fake quantization used in the quantization aware training.

1. Create floating point input tensors $x$, usually filled with random values, and compute their scaling factors $s_{x}$.
2. Quantize the floating point input tensors $x$, resulting in the quantized input tensors $x_{q}$.
3. Dequantize the quantized input tensors, resulting in the dequantized input tensors $x^{\prime}$.
4. Feed the dequantized input tensors to the floating point operation reference implementation $f$, and collect the floating point reference output tensors $y^{\prime} = f(x^{\prime})$, and compute their scaling factors $s_{y}$.
5. Quantize the floating point reference output tensors $y^{\prime}$, resulting in the quantized output tensors $y_{q}$.

To unit test the quantization operation implementation $f_{q}$, we will feed the reference quantized input tensors $x_{q}$, the input tensor scaling factors $s_{x}$, and the output tensor scaling factors $s_{y}$ to the quantization operation implementation $f_{q}$, and compare the quantized output tensors from the quantization operation implementation $f_{q}$, $y^{\prime}_{q} = f_{q}(x_{q}, s_{x}, s_{y})$, with the reference quantized output tensors $y_{q}$.

If $f_{q}$ is implemented correctly, ideally, the quantized output tensors from the quantization operation implementation $f_{q}$, $y^{\prime}_{q} = f_{q}(x_{q}, s_{x}, s_{y})$, should be bitwise-identical to the reference quantized output tensors $y_{q}$.

### The Incorrect Approach

The above approach is the correct way to unit test quantization operation implementations. However, it is very possible that the developer does not have access to the floating point reference implementation $f$. All the developer has is the floating point reference input tensors $x$ and the floating point reference output tensors $y$ from the floating point reference implementation $f$. Can the developer still do anything to test the quantization operation implementation $f_{q}$?

Intuitively, the developer will do the followings.

1. Given the floating point input tensors $x$, compute their scaling factors $s_{x}$.
2. Given the floating point output tensors $x$, compute their scaling factors $s_{y}$.
3. Quantize the floating point input tensors $x$, resulting in the quantized input tensors $x_{q}$.
4. Feed the quantized input tensors $x_{q}$ to the quantization operation implementation $f_{q}$, and collect the quantization output tensors $y_{q} = f_{q}(x_{q}, s_{x}, s_{y})$.
5. Dequantize the quantization output tensors $y_{q} = f_{q}(x_{q}, s_{x}, s_{y})$, resulting in the floating point output tensors $y^{\prime}$.
6. Compare the floating point output tensors $y^{\prime}$ with the floating point reference output tensors $y$.

Unless the developer is extremely lucky, the floating point output tensors $y^{\prime}$ will not be bitwise-identical to the floating point reference output tensors $y$. The difference between the floating point output tensors $y^{\prime}$ and the floating point reference output tensors $y$ is the test error $\Delta y = | y^{\prime} - y |$. I have seen the developer using the test error $\Delta y$ to determine whether the quantization operation implementation $f_{q}$ is implemented correctly.

If the quantization operation implementation $f_{q}$ is implemented correctly, the test error $\Delta y$ is the accumulated quantization error and it’s completely normal. However, if the quantization operation implementation $f_{q}$ is implemented incorrectly, the test error $\Delta y$ will also contain the error due to the incorrect implementation of quantization operation. Even if the quantization operation implementation $f_{q}$ is implemented correctly, the test error $\Delta y$ can still be very large. Even if the test error $\Delta y$ is very small, it does not guarantee that the quantization operation implementation $f_{q}$ is implemented correctly. It becomes ambiguous using the value of the test error $\Delta y$ to determine whether the quantization operation implementation $f_{q}$ is implemented correctly.

So instead of struggling with the test error $\Delta y$, especially when $\Delta y$ is large, the developer should switch to the correct approach mentioned above, either by asking the other developers to generate the reference quantization tensors and the scaling factors or obtaining the floating point reference implementation $f$.

Lei Mao

03-25-2024

03-25-2024