mpi “for UD mlx5 connect on mlx5_0 failed: No such device”

234 Views Asked by At

mpi error is below

[1689646357.071467] [05af046533e9:124545:0]       ib_device.c:1466 UCX  ERROR   ibv_create_ah(dlid=49152 sl=0 port=1 src_path_bits=0 dgid=fe80::1270:fdff:fe44:5170 sgid_index=0 traffic_class=0) for UD mlx5 connect on mlx5_0 failed: No such device
[1689646357.072612] [05af046533e9:124545:0]      ucp_worker.c:2657 UCX  WARN  worker 0x55741a624b40: 1 pending operations were not flushed
Abort(138006287) on node 0 (rank 0 in comm 0): Fatal error in internal_Init_thread: Other MPI error, error stack:
internal_Init_thread(60)......: MPI_Init_thread(argc=0x7ffeb4a07608, argv=0x7ffeb4a07610, required=1, provided=0x7ffeb4a0760c) failed
MPII_Init_thread(232).........: 
MPIR_init_comm_world(34)......: 
MPIR_Comm_commit(722).........: 
MPIR_Comm_commit_internal(510): 
MPID_Comm_commit_pre_hook(158): 
MPIDI_UCX_init_world(288).....: 
initial_address_exchange(145).:  ucx function returned with failed status(ucx_init.c 145 initial_address_exchange Address not valid)

the error is when i use mpi in docker. when I write a hello-world cpp file ,and compile it ,and run mpirun -np 2 ./hello

1

There are 1 best solutions below

1
On

perhaps you haven't install ib/roce network interface card(NIC) driver