Monitoring the progress of xarrays split apply combine using dask

941 Views Asked by At

For my thesis I'm trying to calculate the SLOM score (see https://link.springer.com/article/10.1007/s10115-005-0200-2). This score is purely spatial, and I am trying to calculate it for larger datasets. (upwards of a year).

So far I created a function that calculates SLOM scores per timestep, which returns an xarray dataarray containing the SLOM values.

I am trying to calculate the SLOM values for each timestep.

Currently I am doing this using the groupby: split apply combine strategy (http://xarray.pydata.org/en/stable/groupby.html)

grouped_by_time = xrDS.groupby("time")
xrDS["SLOM"]=grouped_by_time.apply(slom_per_timeslice)

In order to speed up the process I am trying to use the dask functionality built into xarray by loading my data as a daskarray:

xrDS = xr.open_dataset(data_path+file_name, chunks={"lat":-1,"lon":-1,"time": "auto"})

I think the apply function should work with daskarrays, according to the first sentence: http://xarray.pydata.org/en/stable/dask.html#using-dask-with-xarray

Now my question is how do I monitor the progress of the groupedby.apply function? I tried using the dask progressbar:

from dask.diagnostics import ProgressBar

with ProgressBar():
  xrDS["SLOM"]=grouped_by_time.apply(slom_per_timeslice)

Gives the following output:

[########################################] | 100% Completed |  0.1s
[########################################] | 100% Completed |  0.1s
[########################################] | 100% Completed |  0.1s
[########################################] | 100% Completed |  0.1s
[########################################] | 100% Completed |  0.1s
[########################################] | 100% Completed |  0.1s
[########################################] | 100% Completed |  0.1s
[########################################] | 100% Completed |  0.1s
[########################################] | 100% Completed |  0.1s

hundreds of times.

So how do I properly see the progress of the overall calculation?

As a note, I am working small tests on a jupyter notebook on my personal desktop, and performing larger runs on the computer of the research group. I only have ssh access to this machine. Both cases are single machine, so I think the dask default multithreaded scheduler should suffice.

I have looked at the dask dashboard, but how would that work with only ssh access?

1

There are 1 best solutions below

0
On

You can use client = distributed.Client() to set up a cluster that will work with the dashboard. See https://distributed.dask.org/en/latest/quickstart.html#setup-dask-distributed-the-easy-way