NCCL WARN Cuda failure 'invalid device function' and 'invalid device ordinal'

Question

NCCL WARN Cuda failure 'invalid device function' and 'invalid device ordinal'

390 Views Asked by Jiangling Yu At 27 June 2025 at 08:35

Environment:

Framework: TensorFlow Framework version: 2.4.0 Horovod version: 0.25.0 MPI version: 4.0.0 CUDA version: 11.0 NCCL version: 2.8.3 Python version: 3.6 OS and version: Ubuntu 18.04 GCC version: 7.5.0

Hi, I'm using hovorod and TensorFlow2.4 to conduct parallel training on NVIDIA GeForce GTX 1080 Ti GPUs, I use this commond "NCCL_DEBUG=WARN mpirun -n 2 python3 train.py", but it give the following error.

Prefer tf.tensor_scatter_nd_update, which offers the same functionality with well-defined read-write semantics. 2023-07-26 17:30:24.238869: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcublas.so.11 2023-07-26 17:30:24.516433: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcublasLt.so.11 2023-07-26 17:30:24.728742: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudnn.so.8 2023-07-26 17:30:28.026432: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcublas.so.11 2023-07-26 17:30:28.277099: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcublasLt.so.11 2023-07-26 17:30:28.470335: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudnn.so.8

administrator:3274:3298 [0] misc/ibvwrap.cc:63 NCCL WARN Failed to open libibverbs.so[.1] NCCL version 2.8.3+cuda10.0

administrator:3275:3301 [1] misc/ibvwrap.cc:63 NCCL WARN Failed to open libibverbs.so[.1]

administrator:3275:3301 [1] enqueue.cc:231 NCCL WARN Cuda failure 'invalid device function'

administrator:3274:3298 [0] enqueue.cc:231 NCCL WARN Cuda failure 'invalid device function'

administrator:3274:3298 [0] misc/argcheck.cc:14 NCCL WARN AllReduce : sendbuff is not a valid pointer

administrator:3274:3298 [0] init.cc:956 NCCL WARN Cuda failure 'invalid device ordinal'

administrator:3275:3301 [1] misc/argcheck.cc:39 NCCL WARN AllReduce : invalid root 0 (root should be in the 0..0 range)

[administrator:03275] *** Process received signal ***

[administrator:03275] Signal: Segmentation fault (11)

[administrator:03275] Signal code: Address not mapped (1)

[administrator:03275] Failing at address: 0x1

[administrator:03275] [ 0] /lib/x86_64-linux-gnu/libpthread.so.0(+0x12980)[0x7fa2f9ef1980]

[administrator:03275] [ 1] /home/ASR/.conda/envs/tensorflow2/lib/python3.6/site-packages/horovod/tensorflow/mpi_lib.cpython-36m-x86_64-linux-gnu.so(+0x14cf2c)[0x7fa293addf2c]

[administrator:03275] [ 2] /home/ASR/.conda/envs/tensorflow2/lib/python3.6/site-packages/horovod/tensorflow/mpi_lib.cpython-36m-x86_64-linux-gnu.so(_ZN7horovod6common11NCCLContext10ErrorCheckESs12ncclResult_tRP8ncclComm+0x50)[0x7fa293abbbb0]

[administrator:03275] [ 3] /home/ASR/.conda/envs/tensorflow2/lib/python3.6/site-packages/horovod/tensorflow/mpi_lib.cpython-36m-x86_64-linux-gnu.so(_ZN7horovod6common13NCCLAllreduce7ExecuteERSt6vectorINS0_16TensorTableEntryESaIS3_EERKNS0_8ResponseE+0x1e8)[0x7fa293abd4d8]

[administrator:03275] [ 4] /home/ASR/.conda/envs/tensorflow2/lib/python3.6/site-packages/horovod/tensorflow/mpi_lib.cpython-36m-x86_64-linux-gnu.so(_ZNK7horovod6common16OperationManager16ExecuteAllreduceERSt6vectorINS0_16TensorTableEntryESaIS3_EERKNS0_8ResponseE+0x71)[0x7fa293a78dc1]

[administrator:03275] [ 5] /home/ASR/.conda/envs/tensorflow2/lib/python3.6/site-packages/horovod/tensorflow/mpi_lib.cpython-36m-x86_64-linux-gnu.so(_ZNK7horovod6common16OperationManager16ExecuteOperationERSt6vectorINS0_16TensorTableEntryESaIS3_EERKNS0_8ResponseERNS0_10ProcessSetE+0xf1)[0x7fa293a79471]

[administrator:03275] [ 6] /home/ASR/.conda/envs/tensorflow2/lib/python3.6/site-packages/horovod/tensorflow/mpi_lib.cpython-36m-x86_64-linux-gnu.so(+0xb29f6)[0x7fa293a439f6]

[administrator:03275] [ 7] /home/ASR/.conda/envs/tensorflow2/lib/python3.6/site-packages/tensorflow/python/../libtensorflow_framework.so.2(+0x1832eff)[0x7fa2c7208eff]

[administrator:03275] [ 8] /lib/x86_64-linux-gnu/libpthread.so.0(+0x76db)[0x7fa2f9ee66db]

[administrator:03275] [ 9] /lib/x86_64-linux-gnu/libc.so.6(clone+0x3f)[0x7fa2f9c0f61f]

[administrator:03275]

Primary job terminated normally, but 1 process returned a non-zero exit code. Per user-direction, the job has been aborted. mpirun noticed that process rank 1 with PID 0 on node administrator exited on signal 11 (Segmentation fault).

I installed nccl2.8.3-1+cuda11.0, use command “dpkg -l | grep nccl” can see the libnccl2 and libnccl-dev information. enter image description here

And I also tested the nccl with the following steps:

git clone https://github.com/NVIDIA/nccl-tests.git
cd nccl-test
make MPI=1 MPI_HOME=/path/to/mpi CUDA_HOME=/path/to/cuda NCCL_HOME=/path/to/nccl
./build/all_reduce_perf -b 8 -e 128M -f 2 -g 6
mpirun -np 6 ./build/all_reduce_perf -b 8 -e 128M -f 2 -g 1

Everything is OK! enter image description here enter image description here

I also followed the instructions in https://github.com//issues/1171 and confirmed there are sm_61 sections in the NCCL library (the compute capability of NVIDIA GeForce GTX 1080 Ti GPU is 6.1) enter image description here

I also reinstalled horovod 0.25.0 with HOROVOD_GPU_ALLREDUCE=NCCL HOROVOD_CUDA_HOME=/usr/local/cuda-11.0 HOROVOD_NCCL_HOME=/usr/local/cuda-11.0 pip install --no-cache-dir horovod, but it still give the same error.

What's wrong? Anyone can give me some help? Many thanks. By the way, I installed nccl2.8.3+cuda11.0, but why the NCCL_DEBUG give NCCL version 2.8.3+cuda10.0 info? Although I have CUDA10.0 and CUDA11.0 on my machine, but I did't install this nccl version and can not find nccl in CUDA10.0 dir.

Original Q&A

NCCL WARN Cuda failure 'invalid device function' and 'invalid device ordinal'

There are 0 best solutions below

Related Questions in MPI

Related Questions in TENSORFLOW2.0

Related Questions in HOROVOD

Trending Questions

Popular # Hahtags

Popular Questions