# How To Debug Deep Learning Inference Applications

## Introduction

A deep learning inference application is a software system that takes a set of inputs and produces a set of outputs using a set of operations whose parameters are from the deep learning models trained by deep learning training applications. The application inputs are usually a set of tensors, and the application outputs are usually also a set of tensors which human beings can interpret. However, it’s not always necessary that human beings can interpret the inputs and the outputs from the intermediate operations.

As mentioned in my previous blog post “Numerical Errors In HPC and Deep Learning”, because the deep learning inference application usually uses a different software from the deep learning training application, the numerical behaviors of the two software can be different due to various reasons, such as software bugs, implementation differences, and numerical errors. It’s quite common that when you got a deep learning model that was evaluated well using the deep learning training application evaluation phase, the same deep learning model can produce unexpected results when it is deployed using the deep learning inference, which is not acceptable.

In this blog post, I would like to discuss the first principles of evaluating deep learning inference applications and how to debug deep learning inference applications.

## Deep Learning Inference Application Software Architecture

Just like any other software systems, the deep learning inference application can be viewed as a directed graph of operations in which each operation takes a set of tensors as inputs and produces a set of tensors as outputs. In the simplest scenario, the directed graph of operations decays to a sequence of operations where each operation only takes the output tensors of the previous operation as inputs.

If we describe each operation using mathematical notations, we can have the following equation.

\begin{aligned} y = f(x, \theta) \end{aligned}

where $x$ and $y$ are the input and output tensors produced during the application runtime, respectively, and $\theta$ is the operation parameters which are hard-coded or learned from the training phase of the deep learning training applications.

Given the above equation, we can have the following equation for the whole deep learning inference application.

\begin{aligned} y = f_n(f_{n-1}(\cdots f_1(x, \theta_1), \theta_2) \cdots, \theta_n) \end{aligned}

where $f_i$ is the $i$-th operation, and $\theta_i$ is the parameters of the $i$-th operation.

If we expand the above equation, we can have the following equation.

\begin{aligned} x_1 &= x \\ x_2 &= f_1(x_1, \theta_1) \\ x_3 &= f_2(x_2, \theta_2) \\ &\vdots \\ x_n &= f_{n-1}(x_{n-1}, \theta_{n-1}) \\ x_{n+1} &= f_n(x_n, \theta_n) \\ y &= x_{n+1} \\ \end{aligned}

where $x_i$ is the output of the $i$-th operation, and $x_{i+1}$ is the input of the $(i+1)$-th operation.

## Deep Learning Training Application Evaluation Phase Software Architecture

The deep learning training application corresponding to the deep learning inference application can also be viewed as a directed graph of operations in which each operation takes a set of tensors as inputs and produces a set of tensors as outputs. If we ignore the model training operations and only focus on the model evaluation operations, the mathematical description of the deep learning training application can be the same as the deep learning inference application.

Concretely, if we describe each operation using mathematical notations, we can have the following equation.

\begin{aligned} y = g(x, \theta) \end{aligned}

where $x$ and $y$ are the input and output tensors produced during the application runtime, respectively, and $\theta$ is the operation parameters which are hard-coded or learned from the training phase of the deep learning training applications.

Given the above equation, we can have the following equation for the whole evaluation phase of the deep learning training application.

\begin{aligned} y = g_n(g_{n-1}(\cdots g_1(x, \theta_1), \theta_2) \cdots, \theta_n) \end{aligned}

where $g_i$ is the $i$-th operation, and $\theta_i$ is the parameters of the $i$-th operation.

If we expand the above equation, we can have the following equation.

\begin{aligned} x_1 &= x \\ x_2 &= g_1(x_1, \theta_1) \\ x_3 &= g_2(x_2, \theta_2) \\ &\vdots \\ x_n &= g_{n-1}(x_{n-1}, \theta_{n-1}) \\ x_{n+1} &= g_n(x_n, \theta_n) \\ y &= x_{n+1} \\ \end{aligned}

where $x_i$ is the output of the $i$-th operation, and $x_{i+1}$ is the input of the $(i+1)$-th operation.

## First Principles of Evaluating Deep Learning Inference Applications

When the neural network is trained from the training phase of the deep learning training application, it would be evaluated using the evaluation phase of the deep learning training application using the trained model parameters, the evaluation dataset, and some evaluation metrics. The evaluation metrics can be naive or comprehensive or a combination of both. For example, the naive evaluation metric can just be human examination on the visualization of the model outputs, and the comprehensive evaluation metric can be the accuracy of the model outputs on the entire evaluation dataset. When the neural network is evaluated to have sufficiently good performance, it would be deployed to the deep learning inference application.

Ideally, the goal of the deep learning inference application is to produce the same outputs as the deep learning training application evaluation phase given the same inputs, so that the deep learning inference application would have exactly the same performance as the deep learning training application evaluation phase by the evaluation metrics. However, because of various reasons we will mention later, the deep learning inference application can produce different outputs from the deep learning training application evaluation phase given the same inputs, simply comparing the outputs of the deep learning inference application and the deep learning training application evaluation phase could hardly help us analyze the differences between the two applications.

Therefore, the practical goal of the deep learning inference application is to produce outputs that are as close as possible to the outputs of the deep learning training application evaluation phase given the same inputs so that the deep learning inference application would have similar performance as the deep learning training application evaluation phase by the evaluation metrics. Because it is difficult to define how close is close enough, we should primarily focus on the deep learning inference application performance evaluation using evaluation metrics used for the deep learning training application evaluation phase rather than comparing the outputs from the deep learning inference application and the deep learning training application evaluation phase.

Suppose we have two functions, $f$ and $g$, and an input $x$. When the same input $x$ is given to the two functions, the difference between the outputs of the two functions can be described as follows.

\begin{aligned} \Delta y &= f(x) - g(x) \\ \end{aligned}

where $\Delta y$ is the difference between the outputs of the two functions.

If $f$ and $g$ are not exactly the same, even if they are similar, usually $\Delta y \neq 0$. Without any further knowledge about $f$ and $g$, it is difficult to analyze whether $\Delta y$ is small enough or not.

Suppose we have another function $h$, an input $x$ and a similar input $x^{\prime} = x + \Delta x$. When the input $x$ and the similar input $x^{\prime}$ are given to the function $h$, the difference between the outputs of the function $h$ can be described as follows.

\begin{aligned} \Delta y &= h(x) - h(x^{\prime}) \\ &= h(x) - h(x + \Delta x) \\ \end{aligned}

where $\Delta y$ is the difference between the outputs of the function $h$. If $\Delta x \neq 0$, even if $\Delta x$ is small, usually $\Delta y \neq 0$. Without any further knowledge about $h$, it is also difficult to analyze whether $\Delta y$ is small enough or not.

In the context of evaluating the performance of the deep learning inference application, $f$ describes the deep learning inference application, $g$ describes the deep learning training application evaluation phase, $h$ describes the evaluation metrics. The performance difference between the deep learning inference application and the deep learning training application evaluation phase can be described as follows.

\begin{aligned} \Delta y &= h(f(x)) - h(g(x)) \\ &= h(f(x)) - h(f(x) + (g(x) - f(x))) \\ \end{aligned}

Even if the deep learning inference application and the deep learning training application evaluation phase are similar so that $g(x) - f(x)$ is very small but non-zero, without further analyzing the evaluation metrics $h$, we would not know whether the performance difference $\Delta y$ is small enough or not. This suggests that we could not evaluate the performance of the deep learning inference application by just comparing the outputs of the deep learning inference application and the deep learning training application evaluation phase.

In the context of evaluating the correctness of the operators in the deep learning inference application, $f$ describes the operator in the deep learning inference application, $g$ describes the correspondent operator in the deep learning training application evaluation phase, $h$, different from the previous context, describes the next operator in the deep learning training application evaluation phase. Given the same input $x$, the difference between the outputs from the operators $f \rightarrow h$ and $g \rightarrow h$ can be described as follows.

\begin{aligned} \Delta y &= h(f(x)) - h(g(x)) \\ &= h(f(x)) - h(f(x) + (g(x) - f(x))) \\ \end{aligned}

Even if the two operators $f$ and $g$ are similar so that $g(x) - f(x)$ is very small but non-zero, without further analyzing the next operator $h$, we would not know whether the numerical difference $\Delta y$ is small enough or not. In addition, in practice, the $h$ operator used for the deep learning training application evaluation phase is usually different from the $h$ operator used for the deep learning inference application, which can make the numerical difference $\Delta y$ even larger. Moreover, given there are many operators in the application, the numerical difference $\Delta y$ can be accumulated and amplified, which eventually affects the performance of the deep learning inference application. This suggests that we could not evaluate the correctness of the operators in the deep learning inference application by just comparing the outputs of the operator from the the deep learning inference application and the correspondent operator from the deep learning training application evaluation phase.

## Deep Learning Inference Application Bugs

Suppose the deep learning inference application $y = f(x)$ can be described using mathematical notations as follows.

\begin{aligned} x_1 &= x \\ x_2 &= f_1(x_1, \theta_1) \\ x_3 &= f_2(x_2, \theta_2) \\ &\vdots \\ x_n &= f_{n-1}(x_{n-1}, \theta_{n-1}) \\ x_{n+1} &= f_n(x_n, \theta_n) \\ y &= x_{n+1} \\ \end{aligned}

The corresponding deep learning training application evaluation phase $y^{\prime} = g(x^{\prime})$ can be described using mathematical notations as follows.

\begin{aligned} x_1^{\prime} &= x^{\prime} \\ x_2^{\prime} &= g_1(x_1^{\prime}, \theta_1^{\prime}) \\ x_3^{\prime} &= g_2(x_2^{\prime}, \theta_2^{\prime}) \\ &\vdots \\ x_m^{\prime} &= g_{m-1}(x_{m-1}^{\prime}, \theta_{m-1}^{\prime}) \\ x_{m+1}^{\prime} &= g_m(x_m^{\prime}, \theta_n^{\prime}) \\ y^{\prime} &= x_{m+1}^{\prime} \\ \end{aligned}

where $x^{\prime}$ is the input of the deep learning training application evaluation phase, $x_i^{\prime}$ is the output of the $i$-th operation in the deep learning training application evaluation phase, $y^{\prime}$ is the output of the deep learning training application evaluation phase, $g_i$ is the $i$-th operation in the deep learning training application evaluation phase, and $\theta_i^{\prime}$ is the parameters of the $i$-th operation in the deep learning training application evaluation phase.

Note that there is a difference between the number of operators in the deep learning inference application and the deep learning training application evaluation phase, which is $n$ and $m$, respectively. This is because the deep learning inference application can perform a lot of optimizations, such as operator fusion, which results in fewer operators in the deep learning inference application than the deep learning training application evaluation phase. For example, the operation $f_{3}(x)$ from the deep learning inference application might just be corresponding to the operation $g_{5}(g_{4}(g_{3}(x)))$ from the deep learning training application evaluation phase.

Even if using the same input $x = x^{\prime}$, the outputs of the deep learning inference application $y$ and the deep learning training application evaluation phase $y^{\prime}$ can be different, i.e., $y \neq y^{\prime}$. Consequently, the performance of the deep learning inference application can be different from the deep learning training application evaluation phase. When the performance difference is too large to be acceptable, we need to debug the deep learning inference application to find out the root cause of the performance difference.

## The Incorrect Approach of Debugging Deep Learning Inference Applications

There is a very commonly used incorrect approach of debugging deep learning inference applications, which violates the first principles of evaluating deep learning inference applications we discussed in the previous section, even though sometimes it might just work. The incorrect approach is to compare the numerical difference between the intermediate and final outputs of the deep learning inference application and the deep learning training application evaluation phase, and then try to locate which operator or operators in the deep learning inference application is the major error source.

Usually, the developer has the access to some of the intermediate and final outputs of the deep learning inference application and the deep learning training application evaluation phase, but not all of them.

For example, suppose the deep learning inference application has $n = 5$ operators, and the deep learning training application evaluation phase has $m = 8$ operators.

The deep learning inference application $y = f(x)$ can be described using mathematical notations as follows.

\begin{aligned} x_1 &= x \\ x_2 &= f_1(x_1, \theta_1) \\ x_3 &= f_2(x_2, \theta_2) \\ x_4 &= f_3(x_3, \theta_3) \\ x_5 &= f_4(x_4, \theta_4) \\ x_6 &= f_5(x_5, \theta_5) \\ y &= x_6 \\ \end{aligned}

The corresponding deep learning training application evaluation phase $y^{\prime} = g(x^{\prime})$ can be described using mathematical notations as follows.

\begin{aligned} x_1^{\prime} &= x^{\prime} \\ x_2^{\prime} &= g_1(x_1^{\prime}, \theta_1^{\prime}) \\ x_3^{\prime} &= g_2(x_2^{\prime}, \theta_2^{\prime}) \\ x_4^{\prime} &= g_3(x_3^{\prime}, \theta_3^{\prime}) \\ x_5^{\prime} &= g_4(x_4^{\prime}, \theta_4^{\prime}) \\ x_6^{\prime} &= g_5(x_5^{\prime}, \theta_5^{\prime}) \\ x_7^{\prime} &= g_6(x_6^{\prime}, \theta_6^{\prime}) \\ x_8^{\prime} &= g_7(x_7^{\prime}, \theta_7^{\prime}) \\ x_9^{\prime} &= g_8(x_8^{\prime}, \theta_8^{\prime}) \\ y^{\prime} &= x_9^{\prime} \\ \end{aligned}

The developer know that the operator $f_1$ corresponds to the operator $g_1$ and $g_2$, the operator $f_2$ corresponds to the operator $g_3$ and $g_4$, the operator $f_3$ corresponds to the operator $g_5$, the operator $f_4$ corresponds to the operator $g_6$ and $g_7$, and the operator $f_5$ corresponds to the operator $g_8$. The developer has the access of the outputs of the operators $f_1$, $f_2$, $f_3$, $f_4$, $f_5$ from the deep learning inference application, and the outputs of the operators $g_2$, $g_4$, $g_5$, $g_7$, $g_8$, from the deep learning training application evaluation phase, i.e., $x_2$, $x_3$, $x_4$, $x_5$, $x_6$ ($y$), $x_3^{\prime}$, $x_5^{\prime}$, $x_6^{\prime}$, $x_8^{\prime}$, $x_9^{\prime}$ ($y^{\prime}$). Of course, the developer also has the access of the application input $x$ ($x_1$) and $x^{\prime}$ ($x_1^{\prime}$).

The deep learning inference application mathematical notations can be simplified as follows.

\begin{aligned} x_1 &= x \\ x_2 &= f_1(x_1, \theta_1) \\ x_3 &= f_2(x_2, \theta_2) \\ x_4 &= f_3(x_3, \theta_3) \\ x_5 &= f_4(x_4, \theta_4) \\ x_6 &= f_5(x_5, \theta_5) \\ y &= x_6 \\ \end{aligned}

The corresponding deep learning training application evaluation phase mathematical notations can be simplified as follows.

\begin{aligned} x_1^{\prime} &= x^{\prime} \\ x_3^{\prime} &= g_2(g_1(x_1^{\prime}, \theta_1^{\prime}), \theta_2^{\prime}) \\ x_5^{\prime} &= g_4(g_3(x_3^{\prime}, \theta_3^{\prime}), \theta_4^{\prime}) \\ x_6^{\prime} &= g_5(x_5^{\prime}, \theta_5^{\prime}) \\ x_8^{\prime} &= g_7(g_6(x_6^{\prime}, \theta_6^{\prime}), \theta_7^{\prime}) \\ x_9^{\prime} &= g_8(x_8^{\prime}, \theta_8^{\prime}) \\ y^{\prime} &= x_9^{\prime} \\ \end{aligned}

Ideally, the developer would like to see $x_1 = x_1^{\prime}$, $x_2 = x_3^{\prime}$, $x_3 = x_5^{\prime}$, $x_4 = x_6^{\prime}$, $x_5 = x_8^{\prime}$, and $x_6 = x_9^{\prime}$, which means the deep learning inference application and the deep learning training application evaluation phase are exactly the same. However, in practice, as we have mentioned in the previous sections, those values would not be the same. So the incorrect approach the developer usually take is to compare the numerical difference between the intermediate and final outputs of the deep learning inference application and the deep learning training application evaluation phase, and locate the problematic operator or operators in the deep learning inference application to the one or ones that have the largest numerical difference.

In our case, given $x_1 = x_1^{\prime}$, the developer would compare the numerical difference between $x_2$ and $x_3^{\prime}$, $x_3$ and $x_5^{\prime}$, $x_4$ and $x_6^{\prime}$, $x_5$ and $x_8^{\prime}$, and $x_6$ and $x_9^{\prime}$. But the next problems are, how to calculate the numerical difference? Is it the L1 norm or the L2 norm of the absolute difference between the two outputs? Does the numerical difference have to be normalized? How to normalize? Do the numerical difference metrics have to be the same for each of the output pairs? These are extremely difficult questions to answer. Even if we know the answer to them, what if all the numerical differences just look “small”? How to know which operator or operators in the deep learning inference application is the major error source?

This approach would only work if the numerical error is significant enough and the developer has some domain knowledge about the operations. For example, if the L1 norm of the absolute difference between $x_3$ and $x_5^{\prime}$ is $0.001$ and $f_3$ and $g_5$ are just supposed to be a simple ReLU operation, however, the L1 norm of the absolute difference between $x_4$ and $x_6^{\prime}$ is $1000$. Then the developer can be pretty sure that the operator $f_4$ is the major error source. However, if this is the case, this error should usually have been captured by the unit test of the operator $f_4$ in the deep learning inference application, and the developer should have fixed the bug before the deep learning inference application is deployed. In practice, the numerical error is usually not significant enough, and the developer usually does not have enough domain knowledge about the operations, so this approach does not always work and will confuse the developer.

## The Correct Approach of Debugging Deep Learning Inference Applications

The correct approach of debugging deep learning inference applications does not have to ask what the numerical difference between the deep learning inference application and the deep learning training application evaluation phase is and how to calculate the numerical difference. It also requires almost no domain knowledge about the operations from the developer.

The correct approach follows the first principles of evaluating deep learning inference applications we discussed in the previous section, which is to feed the intermediate outputs of the deep learning inference application to the deep learning training application evaluation phase and compare the performance using the evaluation metrics with the expected performance.

Following the example in the previous section, the developer will feed $x_4$ from the deep learning inference application to $x_6^{\prime}$ from the deep learning training application evaluation phase, compute the application output and use the evaluate metrics to compute the performance of the output. The performance difference of using $x_4$ and $x_6^{\prime}$, $\Delta y$, can be described as follows.

\begin{aligned} \Delta y &= h(g_8(g_7(g_6(x_6^{\prime}, \theta_6^{\prime}), \theta_7^{\prime}), \theta_8^{\prime})) - h(g_8(g_7(g_6(x_4, \theta_6^{\prime}), \theta_7^{\prime}), \theta_8^{\prime})) \\ \end{aligned}

where $\Delta y$ is the performance difference of the application final outputs using $x_4$ and $x_6^{\prime}$ in the deep learning training application evaluation phase and the application evaluation metric $h$. If $\Delta y$ is large, the developer can be pretty sure that the major error source is from one or a few operators before the production of $x_4$ in the deep learning inference application. In our case, the culprit operators are among the operators $f_1$, $f_2$, and $f_3$. If $\Delta y$ is small, the developer can be pretty sure that the major error source is from one or a few operators after the production of $x_4$ in the deep learning inference application. In our case, the culprit operators are among the operators $f_4$ and $f_5$.

By repeating the above process, possibly via bisection (binary search), the developer can locate the problematic operator or operators accurately in the deep learning inference application.

## Conclusions

Numerical errors are evil. Implementing a working deep learning inference application requires a lot of sophisticated engineering and systematic testing. Unit testing sometimes will not help identify the problem at the system level. Having a software infrastructure that follows the first principles of evaluating deep learning inference applications and correctly implements the correct approach of debugging deep learning inference applications can significantly reduce the development and maintenance costs of deep learning inference applications.

Lei Mao

01-01-2024

01-01-2024