NCCL WARN Cuda failure 'invalid device function' and 'invalid device ordinal'

371 Views Asked by At

Environment:

Framework: TensorFlow Framework version: 2.4.0 Horovod version: 0.25.0 MPI version: 4.0.0 CUDA version: 11.0 NCCL version: 2.8.3 Python version: 3.6 OS and version: Ubuntu 18.04 GCC version: 7.5.0

Hi, I'm using hovorod and TensorFlow2.4 to conduct parallel training on NVIDIA GeForce GTX 1080 Ti GPUs, I use this commond "NCCL_DEBUG=WARN mpirun -n 2 python3 train.py", but it give the following error.

Prefer tf.tensor_scatter_nd_update, which offers the same functionality with well-defined read-write semantics. 2023-07-26 17:30:24.238869: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcublas.so.11 2023-07-26 17:30:24.516433: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcublasLt.so.11 2023-07-26 17:30:24.728742: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudnn.so.8 2023-07-26 17:30:28.026432: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcublas.so.11 2023-07-26 17:30:28.277099: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcublasLt.so.11 2023-07-26 17:30:28.470335: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudnn.so.8

administrator:3274:3298 [0] misc/ibvwrap.cc:63 NCCL WARN Failed to open libibverbs.so[.1] NCCL version 2.8.3+cuda10.0

administrator:3275:3301 [1] misc/ibvwrap.cc:63 NCCL WARN Failed to open libibverbs.so[.1]

administrator:3275:3301 [1] enqueue.cc:231 NCCL WARN Cuda failure 'invalid device function'

administrator:3274:3298 [0] enqueue.cc:231 NCCL WARN Cuda failure 'invalid device function'

administrator:3274:3298 [0] misc/argcheck.cc:14 NCCL WARN AllReduce : sendbuff is not a valid pointer

administrator:3274:3298 [0] init.cc:956 NCCL WARN Cuda failure 'invalid device ordinal'

administrator:3275:3301 [1] misc/argcheck.cc:39 NCCL WARN AllReduce : invalid root 0 (root should be in the 0..0 range)

[administrator:03275] *** Process received signal ***

[administrator:03275] Signal: Segmentation fault (11)

[administrator:03275] Signal code: Address not mapped (1)

[administrator:03275] Failing at address: 0x1

[administrator:03275] [ 0] /lib/x86_64-linux-gnu/libpthread.so.0(+0x12980)[0x7fa2f9ef1980]

[administrator:03275] [ 1] /home/ASR/.conda/envs/tensorflow2/lib/python3.6/site-packages/horovod/tensorflow/mpi_lib.cpython-36m-x86_64-linux-gnu.so(+0x14cf2c)[0x7fa293addf2c]

[administrator:03275] [ 2] /home/ASR/.conda/envs/tensorflow2/lib/python3.6/site-packages/horovod/tensorflow/mpi_lib.cpython-36m-x86_64-linux-gnu.so(_ZN7horovod6common11NCCLContext10ErrorCheckESs12ncclResult_tRP8ncclComm+0x50)[0x7fa293abbbb0]

[administrator:03275] [ 3] /home/ASR/.conda/envs/tensorflow2/lib/python3.6/site-packages/horovod/tensorflow/mpi_lib.cpython-36m-x86_64-linux-gnu.so(_ZN7horovod6common13NCCLAllreduce7ExecuteERSt6vectorINS0_16TensorTableEntryESaIS3_EERKNS0_8ResponseE+0x1e8)[0x7fa293abd4d8]

[administrator:03275] [ 4] /home/ASR/.conda/envs/tensorflow2/lib/python3.6/site-packages/horovod/tensorflow/mpi_lib.cpython-36m-x86_64-linux-gnu.so(_ZNK7horovod6common16OperationManager16ExecuteAllreduceERSt6vectorINS0_16TensorTableEntryESaIS3_EERKNS0_8ResponseE+0x71)[0x7fa293a78dc1]

[administrator:03275] [ 5] /home/ASR/.conda/envs/tensorflow2/lib/python3.6/site-packages/horovod/tensorflow/mpi_lib.cpython-36m-x86_64-linux-gnu.so(_ZNK7horovod6common16OperationManager16ExecuteOperationERSt6vectorINS0_16TensorTableEntryESaIS3_EERKNS0_8ResponseERNS0_10ProcessSetE+0xf1)[0x7fa293a79471]

[administrator:03275] [ 6] /home/ASR/.conda/envs/tensorflow2/lib/python3.6/site-packages/horovod/tensorflow/mpi_lib.cpython-36m-x86_64-linux-gnu.so(+0xb29f6)[0x7fa293a439f6]

[administrator:03275] [ 7] /home/ASR/.conda/envs/tensorflow2/lib/python3.6/site-packages/tensorflow/python/../libtensorflow_framework.so.2(+0x1832eff)[0x7fa2c7208eff]

[administrator:03275] [ 8] /lib/x86_64-linux-gnu/libpthread.so.0(+0x76db)[0x7fa2f9ee66db]

[administrator:03275] [ 9] /lib/x86_64-linux-gnu/libc.so.6(clone+0x3f)[0x7fa2f9c0f61f]

[administrator:03275]

Primary job terminated normally, but 1 process returned a non-zero exit code. Per user-direction, the job has been aborted. mpirun noticed that process rank 1 with PID 0 on node administrator exited on signal 11 (Segmentation fault).

I installed nccl2.8.3-1+cuda11.0, use command “dpkg -l | grep nccl” can see the libnccl2 and libnccl-dev information. enter image description here

And I also tested the nccl with the following steps:

  1. git clone https://github.com/NVIDIA/nccl-tests.git
  2. cd nccl-test
  3. make MPI=1 MPI_HOME=/path/to/mpi CUDA_HOME=/path/to/cuda NCCL_HOME=/path/to/nccl
  4. ./build/all_reduce_perf -b 8 -e 128M -f 2 -g 6
  5. mpirun -np 6 ./build/all_reduce_perf -b 8 -e 128M -f 2 -g 1

Everything is OK! enter image description here enter image description here

I also followed the instructions in https://github.com//issues/1171 and confirmed there are sm_61 sections in the NCCL library (the compute capability of NVIDIA GeForce GTX 1080 Ti GPU is 6.1) enter image description here

I also reinstalled horovod 0.25.0 with HOROVOD_GPU_ALLREDUCE=NCCL HOROVOD_CUDA_HOME=/usr/local/cuda-11.0 HOROVOD_NCCL_HOME=/usr/local/cuda-11.0 pip install --no-cache-dir horovod, but it still give the same error.

What's wrong? Anyone can give me some help? Many thanks. By the way, I installed nccl2.8.3+cuda11.0, but why the NCCL_DEBUG give NCCL version 2.8.3+cuda10.0 info? Although I have CUDA10.0 and CUDA11.0 on my machine, but I did't install this nccl version and can not find nccl in CUDA10.0 dir.

0

There are 0 best solutions below