Lei Mao bio photo

Lei Mao

Machine Learning, Artificial Intelligence, Computer Science.

Twitter Facebook LinkedIn GitHub   G. Scholar E-Mail RSS

Introduction

It has been a few times that I ran into the following error when I was using TensorFlow in the Docker container.

Unknown: Failed to get convolution algorithm. This is probably because cuDNN failed to initialize, so try looking to see if a warning log message was printed above.

It seems that it is because cuDNN failed to initialize. However, the reasons behind causing this is unknown.


Usually restarting the computer would solve the problem. However, if we don’t want to spend time restarting the computer, there are some quick solutions.

Solutions

Add the following code after import headers in TensorFlow Python scripts. Then we should have no problem initializing cuDNN.

TensorFlow 1.x

config = tf.ConfigProto()
config.gpu_options.allow_growth = True
sess = tf.Session(config=config)

TensorFlow-Keras 1.x

config = tf.ConfigProto()
config.gpu_options.allow_growth = True
tf.keras.backend.set_session(tf.Session(config=config))

TensorFlow 2.x

gpu_devices = tf.config.experimental.list_physical_devices('GPU')
for device in gpu_devices:
    tf.config.experimental.set_memory_growth(device, True)

TensorFlow Allow Growth

By default, TensorFlow would use all the GPU memory regardless the size of the model you are running. That is also why we would need to specify the visible GPU devices when we are running the model on a multi-GPU server to prevent collisions with others. To only use the memory required for the model, we set the GPU memory to allow growth. But it still remains mysterious to me why cuDNN sometimes would fail to initialize if the GPU memory does not allow growth.

References