Solving TensorFlow cuDNN Initialization Failure Problem
Introduction
It has been a few times that I ran into the following error when I was using TensorFlow in the Docker container.
1 | Unknown: Failed to get convolution algorithm. This is probably because cuDNN failed to initialize, so try looking to see if a warning log message was printed above. |
It seems that it is because cuDNN failed to initialize. However, the reasons behind causing this are unknown.
Usually restarting the computer would solve the problem. However, if we don’t want to spend time restarting the computer, there are some quick solutions.
Solutions
Add the following code after import headers in TensorFlow Python scripts. Then we should have no problem initializing cuDNN.
TensorFlow 1.x
1 | config = tf.ConfigProto() |
TensorFlow-Keras 1.x
1 | config = tf.ConfigProto() |
TensorFlow 2.x
1 | gpu_devices = tf.config.experimental.list_physical_devices('GPU') |
TensorFlow Allow Growth
By default, TensorFlow would use all the GPU memory regardless of the size of the model you are running. That is also why we would need to specify the visible GPU devices when we are running the model on a multi-GPU server to prevent collisions with others. To only use the memory required for the model, we set the GPU memory to allow growth. But it still remains mysterious to me why cuDNN sometimes would fail to initialize if the GPU memory does not allow growth.
References
Solving TensorFlow cuDNN Initialization Failure Problem