Numerical Errors In HPC and Deep Learning
Introduction
“Numerical errors are evil” is something that I often say when I worked on the mathematical components of some applications. In most of the computer programs, the numbers are represented using finite number of bits, thus the numbers, especially the floating point values, cannot be accurately represented. The numerical errors can accumulate and cause the program to produce unexpected results. What’s worse is often the time we could not know if certain numerical errors are acceptable for the application until we run the end-to-end application for verification.
High performance computing (HPC) applications and deep learning applications are two types of applications that perform a lot of mathematical operations and have different sensitivity to numerical errors. In this blog post, I would like to discuss the numerical errors in HPC and deep learning applications.
High Performance Computing Application Numerical Errors
HPC applications is usually very sensitive to numerical errors. Most of the HPC applications are iterative algorithms. The numerical errors can accumulate during the iterations and cause the algorithm to diverge. The numerical errors can also cause the algorithm to converge to a wrong solution. That’s why HPC applications usually use double precision floating point numbers to reduce the numerical errors.
HPC applications usually use MPI to distribute the workload to multiple nodes. The determinism is sacrificed for the parallelism, as the order of the operations becomes non-deterministic for the best performance and the order of the operations really makes a difference for some operations. Using double precision floating point numbers can reduce the numerical errors brought by such instability.
Neural Network Application Numerical Errors
Unlike HPC applications, neural network applications usually use two software. One is for neural network training that obtains the model parameters, such as PyTorch, and the other is for neural network inference that serves the model, such as TensorRT. This discrepancy causes the numerical errors. Even if both software use the same IEEE-standard data type for storing the model parameters and the intermediate tensors, the numerical errors still exist. The reason is that the two software can use different implementations for the same mathematical operations.
Even if the neural network applications has one more key error source, that is using two software, comparing to the HPC applications, the numerical errors are usually not a problem for neural network applications. It’s very common to see a neural network was trained using single precision floating point numbers in the training software and then deployed using half single precision floating point numbers in the inference software. The neural network is robust to the numerical errors because the neural network is trained with a lot of data and the neural network learns to be robust to the numerical errors using some additional tricks if necessary.
Quiz
If a neural network is trained using few data, say in an over-fitting experiment, without any additional tricks to make the neural network more robust to numerical errors, how would the neural network perform when the same training data was fed if it is deployed using an inference software that is different from the training software?
Although it really depends on the neural network architecture and some neural network layers are more robust to noise, the neural network is usually not robust to the numerical errors in this case. You can imagine the errors can propagate through the neural network layers and cause the neural network to produce expected results.
Numerical Errors In HPC and Deep Learning
https://leimao.github.io/blog/Numerical-Errors-In-HPC-Deep-Learning/