Error: tensorflow/contrib/nccl/kernels/nccl_manager.cc:273 check failed: result==ncclSuccess (2 vs 0)system error

328 Views Asked by At

I am trying to run distributed tensorflow code using MirrorStrategy option alongwith tensorflow estimator API and getting the error as mentioned in the title. I am using tensorflow-gpu 1.9.0. I am following link for distributed tensorflow training.

and getting below mentioned warning along with error: You should always run with libnvidia-ml.so that is installed with your NVIDIA Display Driver. By default it's I installed in /usr/lib and /usr/lib64 . libnvidia-ml.so in GDK package is a stub library that is attached only for build purposes (e.g. machine that you build your application doesn't have to have display driver installed).

1

There are 1 best solutions below

0
On

Chances are you've got the stub libraries for compilation, and your LD_LIBRARY_PATH doesn't include the path for the runtime libraries.

Check your library path for "/usr/local/cuda/lib64/stubs" or something similar. If it exists, you just need to put the correct location before it in your library path.

Depending on the driver version you have installed, you may be able to find the libnvidia-ml.so file under "/usr/lib/nvidia-384" or some number other than 384 that matches your nvidia driver version.

You could add a line to your .bashrc file that looks something like the following:

export LD_LIBRARY_PATH=/usr/lib/nvidia-(Your driver number here):$LD_LIBRARY_PATH