Dask +SLURM over ftp mount (CurlFtpFS)

145 Views Asked by At

So I have a working DASK/SLURM cluster of 4 raspberry Pis with a common NFS share, that I can run Python jobs succesfully.

However, I want to add some more arm devices to my cluster that do not support NFS mounts (Kernel module missing) so I wish to move to fuse based ftp mounts wiht CurlftpFS.

I have setup the mounts sucesfully with anonymous username and without any passwords and the common FTP share can be seen by all the nodes (just as before when it was an NFS share).

I can still run SLURM jobs (since they do not use the share) but when I try to run a DASK job the master node timesout complaining that no worker nodes could be started.

I am not sure what exactly is the problem, since the share it open to anyone for read/write access (e.g. logs and dask queue intermediate files).

Any ideas how I can troubleshoot this?

1

There are 1 best solutions below

1
On

I don't believe anyone has a cluster like yours! At a guess, the filesystem access via FUSE, ftp and the pi is much slower than the OS is expecting, and you are seeing the effects of low-level timeouts, i.e., from Dask's point of view it appears that files reads are failing. Dask needs access to storage for configuration and sometimes temporary files. You would want to make sure that these locations are on local storage or tuned off. However, if this is happening during import of modules, which you have on the shared drive by design, there may be no fixing it (python loads many small files during import). Why not use rsync to move the files to the nodes?