I would like to take advantage of the InfiniBand network to connect Dask Client and the workers and scheduler (especially between the clients and workers -not necessary with GPUs- as I scatter some data directly to workers). I am using the CLI to launch the scheduler in one separate node, and the workers in N nodes and I have M Clients. Apparently adding --interface ib0 is not enough. And adding --protocol ucx:// generates this error :
0: cpu-bind=MASK - node124, task 0 0 [21536]: mask 0xffffffffff set
0: distributed.scheduler - INFO - -----------------------------------------------
0: distributed.scheduler - INFO - -----------------------------------------------
0: distributed.scheduler - INFO - Clear task state
0: Traceback (most recent call last):
0: File ".conda/envs/Env/lib/python3.8/site-packages/distributed/cli/dask_scheduler.py", line 217, in main
0: loop.run_sync(run)
0: File ".conda/envs/Env/lib/python3.8/site-packages/tornado/ioloop.py", line 532, in run_sync
0: return future_cell[0].result()
0: File ".conda/envs/Env/lib/python3.8/site-packages/distributed/cli/dask_scheduler.py", line 213, in run
0: await scheduler
0: File ".conda/envs/Env/lib/python3.8/site-packages/distributed/core.py", line 305, in _
0: await self.start()
0: File ".conda/envs/Env/lib/python3.8/site-packages/distributed/scheduler.py", line 1477, in start
0: await self.listen(
0: File ".conda/envs/Env/lib/python3.8/site-packages/distributed/core.py", line 430, in listen
0: listener = await listen(
0: File ".conda/envs/Env/lib/python3.8/site-packages/distributed/comm/core.py", line 209, in _
0: await self.start()
0: File ".conda/envs/Env/lib/python3.8/site-packages/distributed/comm/ucx.py", line 385, in start
0: init_once()
0: File ".conda/envs/Env/lib/python3.8/site-packages/distributed/comm/ucx.py", line 56, in init_once
0: import ucp as _ucp
0: ModuleNotFoundError: No module named 'ucp'
0:
0: During handling of the above exception, another exception occurred:
0:
0: Traceback (most recent call last):
0: File ".conda/envs/Env/bin/dask-scheduler", line 11, in <module>
0: sys.exit(go())
0: File ".conda/envs/Env/lib/python3.8/site-packages/distributed/cli/dask_scheduler.py", line 226, in go
0: main()
0: File ".conda/envs/Env/lib/python3.8/site-packages/click/core.py", line 829, in __call__
0: return self.main(*args, **kwargs)
0: File ".conda/envs/Env/lib/python3.8/site-packages/click/core.py", line 782, in main
0: rv = self.invoke(ctx)
0: File ".conda/envs/Env/python3.8/site-packages/click/core.py", line 1066, in invoke
0: return ctx.invoke(self.callback, **ctx.params)
0: File ".conda/envs/Env/python3.8/site-packages/click/core.py", line 610, in invoke
0: return callback(*args, **kwargs)
0: File ".conda/envs/Env/lib/python3.8/site-packages/distributed/cli/dask_scheduler.py", line 221, in main
0: logger.info("End scheduler at %r", scheduler.address)
0: File ".conda/envs/Env/lib/python3.8/site-packages/distributed/core.py", line 389, in address
0: raise ValueError("cannot get address of non-running Server")
0: ValueError: cannot get address of non-running Server
- Do I need to install UCX-Py first in my system, then use --protocol ucx:// and it's enough to take advantage of IB? if not how can I enable IB or IB through UCX?
- Is there any other way to use the infiniband for communication in Dask.distributed ?
- If I use UXC and the IB, does this imply that my clients also communicate over it, or just the workers and the scheduler do?
Thank you :)