Dask gateway workers pods always get CrashLoopBackOff status

47 Views Asked by At

We have a Dask Gateway 2023.9.0 installed on Kubernetes cluster (EKS) with IPv6. When I tryed to create a cluser all workes pods got a status CrashLoopBackOff and in the logs I saw text like this

/home/dask/.local/lib/python3.11/site-packages/distributed/cli/dask_worker.py:266: FutureWarning: dask-worker is deprecated and will be removed in a future release; use `dask worker` instead
  warnings.warn(
/home/dask/.local/lib/python3.11/site-packages/distributed/utils.py:165: RuntimeWarning: Couldn't detect a suitable IP address for reaching 'dask-2884f65ecbc44103ac47e7c620232833.dask', defaulting to hostname: [Errno -5] No address associated with hostname
  warnings.warn(
2023-09-26 08:07:32,383 - distributed.dask_worker - INFO - End worker
Traceback (most recent call last):
  File "/home/dask/.local/lib/python3.11/site-packages/toolz/functoolz.py", line 457, in memof
    return cache[k]
           ~~~~~^^^
KeyError: ('dask-2884f65ecbc44103ac47e7c620232833.dask', 80)

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/dask/.local/lib/python3.11/site-packages/distributed/utils.py", line 161, in _get_ip
    sock.connect((host, port))
socket.gaierror: [Errno -5] No address associated with hostname

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/dask/.local/bin/dask-worker", line 8, in <module>
    sys.exit(main())
             ^^^^^^
  File "/home/dask/.local/lib/python3.11/site-packages/click/core.py", line 1130, in __call__
    return self.main(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/dask/.local/lib/python3.11/site-packages/click/core.py", line 1055, in main
    rv = self.invoke(ctx)
         ^^^^^^^^^^^^^^^^
  File "/home/dask/.local/lib/python3.11/site-packages/click/core.py", line 1404, in invoke
    return ctx.invoke(self.callback, **ctx.params)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    return __callback(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/dask/.local/lib/python3.11/site-packages/distributed/cli/dask_worker.py", line 447, in main
    asyncio.run(run())
  File "/usr/local/lib/python3.11/asyncio/runners.py", line 190, in run
    return runner.run(main)
           ^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/asyncio/runners.py", line 118, in run
    return self._loop.run_until_complete(task)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/asyncio/base_events.py", line 653, in run_until_complete
    return future.result()
           ^^^^^^^^^^^^^^^
  File "/home/dask/.local/lib/python3.11/site-packages/distributed/cli/dask_worker.py", line 397, in run
    nannies = [
              ^
  File "/home/dask/.local/lib/python3.11/site-packages/distributed/cli/dask_worker.py", line 398, in <listcomp>
    t(
  File "/home/dask/.local/lib/python3.11/site-packages/distributed/nanny.py", line 281, in __init__
    host = get_ip(get_address_host(self.scheduler.address))
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/dask/.local/lib/python3.11/site-packages/distributed/utils.py", line 185, in get_ip
    return _get_ip(host, port, family=socket.AF_INET)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/dask/.local/lib/python3.11/site-packages/toolz/functoolz.py", line 461, in memof
    cache[k] = result = func(*args, **kwargs)
                        ^^^^^^^^^^^^^^^^^^^^^
  File "/home/dask/.local/lib/python3.11/site-packages/distributed/utils.py", line 170, in _get_ip
    addr_info = socket.getaddrinfo(
                ^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/socket.py", line 962, in getaddrinfo
    for res in _socket.getaddrinfo(host, port, family, type, proto, flags):
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
socket.gaierror: [Errno -2] Name or service not known

The dask scheduler logs

2023-09-27 09:07:26,912 - distributed.scheduler - INFO - State start
2023-09-27 09:07:26,915 - distributed.scheduler - INFO - -----------------------------------------------
2023-09-27 09:07:26,917 - distributed.scheduler - INFO -   Scheduler at: tls://169.254.175.125:8786
2023-09-27 09:07:26,917 - distributed.scheduler - INFO -   dashboard at:  http://169.254.175.125:8787/status
2023-09-27 09:07:26,917 - distributed.preloading - INFO - Run preload setup: dask_gateway.scheduler_preload

I'm not sure but It seems that the scheduler does not listen to IPV6 and workers can't connect to it. If I'm right how can I configure the Dask Gateway Helm chart to fix it?

0

There are 0 best solutions below