Using UCX protocol Dask Distributed

142 Views Asked by At

I would like to take advantage of the InfiniBand network to connect Dask Client and the workers and scheduler (especially between the clients and workers -not necessary with GPUs- as I scatter some data directly to workers). I am using the CLI to launch the scheduler in one separate node, and the workers in N nodes and I have M Clients. Apparently adding --interface ib0 is not enough. And adding --protocol ucx:// generates this error :

0: cpu-bind=MASK - node124, task  0  0 [21536]: mask 0xffffffffff set
0: distributed.scheduler - INFO - -----------------------------------------------
0: distributed.scheduler - INFO - -----------------------------------------------
0: distributed.scheduler - INFO - Clear task state
0: Traceback (most recent call last):
0:   File ".conda/envs/Env/lib/python3.8/site-packages/distributed/cli/dask_scheduler.py", line 217, in main
0:     loop.run_sync(run)
0:   File ".conda/envs/Env/lib/python3.8/site-packages/tornado/ioloop.py", line 532, in run_sync
0:     return future_cell[0].result()
0:   File ".conda/envs/Env/lib/python3.8/site-packages/distributed/cli/dask_scheduler.py", line 213, in run
0:     await scheduler
0:   File ".conda/envs/Env/lib/python3.8/site-packages/distributed/core.py", line 305, in _
0:     await self.start()
0:   File ".conda/envs/Env/lib/python3.8/site-packages/distributed/scheduler.py", line 1477, in start
0:     await self.listen(
0:   File ".conda/envs/Env/lib/python3.8/site-packages/distributed/core.py", line 430, in listen
0:     listener = await listen(
0:   File ".conda/envs/Env/lib/python3.8/site-packages/distributed/comm/core.py", line 209, in _
0:     await self.start()
0:   File ".conda/envs/Env/lib/python3.8/site-packages/distributed/comm/ucx.py", line 385, in start
0:     init_once()
0:   File ".conda/envs/Env/lib/python3.8/site-packages/distributed/comm/ucx.py", line 56, in init_once
0:     import ucp as _ucp
0: ModuleNotFoundError: No module named 'ucp'
0: 
0: During handling of the above exception, another exception occurred:
0: 
0: Traceback (most recent call last):
0:   File ".conda/envs/Env/bin/dask-scheduler", line 11, in <module>
0:     sys.exit(go())
0:   File ".conda/envs/Env/lib/python3.8/site-packages/distributed/cli/dask_scheduler.py", line 226, in go
0:     main()
0:   File ".conda/envs/Env/lib/python3.8/site-packages/click/core.py", line 829, in __call__
0:     return self.main(*args, **kwargs)
0:   File ".conda/envs/Env/lib/python3.8/site-packages/click/core.py", line 782, in main
0:     rv = self.invoke(ctx)
0:   File ".conda/envs/Env/python3.8/site-packages/click/core.py", line 1066, in invoke
0:     return ctx.invoke(self.callback, **ctx.params)
0:   File ".conda/envs/Env/python3.8/site-packages/click/core.py", line 610, in invoke
0:     return callback(*args, **kwargs)
0:   File ".conda/envs/Env/lib/python3.8/site-packages/distributed/cli/dask_scheduler.py", line 221, in main
0:     logger.info("End scheduler at %r", scheduler.address)
0:   File ".conda/envs/Env/lib/python3.8/site-packages/distributed/core.py", line 389, in address
0:     raise ValueError("cannot get address of non-running Server")
0: ValueError: cannot get address of non-running Server

  • Do I need to install UCX-Py first in my system, then use --protocol ucx:// and it's enough to take advantage of IB? if not how can I enable IB or IB through UCX?
  • Is there any other way to use the infiniband for communication in Dask.distributed ?
  • If I use UXC and the IB, does this imply that my clients also communicate over it, or just the workers and the scheduler do?

Thank you :)

0

There are 0 best solutions below