Environment:
Framework: TensorFlow Framework version: 2.4.0 Horovod version: 0.25.0 MPI version: 4.0.0 CUDA version: 11.0 NCCL version: 2.8.3 Python version: 3.6 OS and version: Ubuntu 18.04 GCC version: 7.5.0
Hi, I'm using hovorod and TensorFlow2.4 to conduct parallel training on NVIDIA GeForce GTX 1080 Ti GPUs, I use this commond "NCCL_DEBUG=WARN mpirun -n 2 python3 train.py", but it give the following error.
Prefer tf.tensor_scatter_nd_update, which offers the same functionality with well-defined read-write semantics. 2023-07-26 17:30:24.238869: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcublas.so.11 2023-07-26 17:30:24.516433: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcublasLt.so.11 2023-07-26 17:30:24.728742: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudnn.so.8 2023-07-26 17:30:28.026432: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcublas.so.11 2023-07-26 17:30:28.277099: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcublasLt.so.11 2023-07-26 17:30:28.470335: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudnn.so.8
administrator:3274:3298 [0] misc/ibvwrap.cc:63 NCCL WARN Failed to open libibverbs.so[.1] NCCL version 2.8.3+cuda10.0
administrator:3275:3301 [1] misc/ibvwrap.cc:63 NCCL WARN Failed to open libibverbs.so[.1]
administrator:3275:3301 [1] enqueue.cc:231 NCCL WARN Cuda failure 'invalid device function'
administrator:3274:3298 [0] enqueue.cc:231 NCCL WARN Cuda failure 'invalid device function'
administrator:3274:3298 [0] misc/argcheck.cc:14 NCCL WARN AllReduce : sendbuff is not a valid pointer
administrator:3274:3298 [0] init.cc:956 NCCL WARN Cuda failure 'invalid device ordinal'
administrator:3275:3301 [1] misc/argcheck.cc:39 NCCL WARN AllReduce : invalid root 0 (root should be in the 0..0 range)
[administrator:03275] *** Process received signal ***
[administrator:03275] Signal: Segmentation fault (11)
[administrator:03275] Signal code: Address not mapped (1)
[administrator:03275] Failing at address: 0x1
[administrator:03275] [ 0] /lib/x86_64-linux-gnu/libpthread.so.0(+0x12980)[0x7fa2f9ef1980]
[administrator:03275] [ 1] /home/ASR/.conda/envs/tensorflow2/lib/python3.6/site-packages/horovod/tensorflow/mpi_lib.cpython-36m-x86_64-linux-gnu.so(+0x14cf2c)[0x7fa293addf2c]
[administrator:03275] [ 2] /home/ASR/.conda/envs/tensorflow2/lib/python3.6/site-packages/horovod/tensorflow/mpi_lib.cpython-36m-x86_64-linux-gnu.so(_ZN7horovod6common11NCCLContext10ErrorCheckESs12ncclResult_tRP8ncclComm+0x50)[0x7fa293abbbb0]
[administrator:03275] [ 3] /home/ASR/.conda/envs/tensorflow2/lib/python3.6/site-packages/horovod/tensorflow/mpi_lib.cpython-36m-x86_64-linux-gnu.so(_ZN7horovod6common13NCCLAllreduce7ExecuteERSt6vectorINS0_16TensorTableEntryESaIS3_EERKNS0_8ResponseE+0x1e8)[0x7fa293abd4d8]
[administrator:03275] [ 4] /home/ASR/.conda/envs/tensorflow2/lib/python3.6/site-packages/horovod/tensorflow/mpi_lib.cpython-36m-x86_64-linux-gnu.so(_ZNK7horovod6common16OperationManager16ExecuteAllreduceERSt6vectorINS0_16TensorTableEntryESaIS3_EERKNS0_8ResponseE+0x71)[0x7fa293a78dc1]
[administrator:03275] [ 5] /home/ASR/.conda/envs/tensorflow2/lib/python3.6/site-packages/horovod/tensorflow/mpi_lib.cpython-36m-x86_64-linux-gnu.so(_ZNK7horovod6common16OperationManager16ExecuteOperationERSt6vectorINS0_16TensorTableEntryESaIS3_EERKNS0_8ResponseERNS0_10ProcessSetE+0xf1)[0x7fa293a79471]
[administrator:03275] [ 6] /home/ASR/.conda/envs/tensorflow2/lib/python3.6/site-packages/horovod/tensorflow/mpi_lib.cpython-36m-x86_64-linux-gnu.so(+0xb29f6)[0x7fa293a439f6]
[administrator:03275] [ 7] /home/ASR/.conda/envs/tensorflow2/lib/python3.6/site-packages/tensorflow/python/../libtensorflow_framework.so.2(+0x1832eff)[0x7fa2c7208eff]
[administrator:03275] [ 8] /lib/x86_64-linux-gnu/libpthread.so.0(+0x76db)[0x7fa2f9ee66db]
[administrator:03275] [ 9] /lib/x86_64-linux-gnu/libc.so.6(clone+0x3f)[0x7fa2f9c0f61f]
[administrator:03275]
Primary job terminated normally, but 1 process returned a non-zero exit code. Per user-direction, the job has been aborted. mpirun noticed that process rank 1 with PID 0 on node administrator exited on signal 11 (Segmentation fault).
I installed nccl2.8.3-1+cuda11.0, use command “dpkg -l | grep nccl” can see the libnccl2 and libnccl-dev information. enter image description here
And I also tested the nccl with the following steps:
- git clone https://github.com/NVIDIA/nccl-tests.git
- cd nccl-test
- make MPI=1 MPI_HOME=/path/to/mpi CUDA_HOME=/path/to/cuda NCCL_HOME=/path/to/nccl
- ./build/all_reduce_perf -b 8 -e 128M -f 2 -g 6
- mpirun -np 6 ./build/all_reduce_perf -b 8 -e 128M -f 2 -g 1
Everything is OK! enter image description here enter image description here
I also followed the instructions in https://github.com//issues/1171 and confirmed there are sm_61 sections in the NCCL library (the compute capability of NVIDIA GeForce GTX 1080 Ti GPU is 6.1) enter image description here
I also reinstalled horovod 0.25.0 with HOROVOD_GPU_ALLREDUCE=NCCL HOROVOD_CUDA_HOME=/usr/local/cuda-11.0 HOROVOD_NCCL_HOME=/usr/local/cuda-11.0 pip install --no-cache-dir horovod, but it still give the same error.
What's wrong? Anyone can give me some help? Many thanks. By the way, I installed nccl2.8.3+cuda11.0, but why the NCCL_DEBUG give NCCL version 2.8.3+cuda10.0 info? Although I have CUDA10.0 and CUDA11.0 on my machine, but I did't install this nccl version and can not find nccl in CUDA10.0 dir.