For my thesis I'm trying to calculate the SLOM score (see https://link.springer.com/article/10.1007/s10115-005-0200-2). This score is purely spatial, and I am trying to calculate it for larger datasets. (upwards of a year).
So far I created a function that calculates SLOM scores per timestep, which returns an xarray dataarray containing the SLOM values.
I am trying to calculate the SLOM values for each timestep.
Currently I am doing this using the groupby: split apply combine strategy (http://xarray.pydata.org/en/stable/groupby.html)
grouped_by_time = xrDS.groupby("time")
xrDS["SLOM"]=grouped_by_time.apply(slom_per_timeslice)
In order to speed up the process I am trying to use the dask functionality built into xarray by loading my data as a daskarray:
xrDS = xr.open_dataset(data_path+file_name, chunks={"lat":-1,"lon":-1,"time": "auto"})
I think the apply function should work with daskarrays, according to the first sentence: http://xarray.pydata.org/en/stable/dask.html#using-dask-with-xarray
Now my question is how do I monitor the progress of the groupedby.apply function? I tried using the dask progressbar:
from dask.diagnostics import ProgressBar
with ProgressBar():
xrDS["SLOM"]=grouped_by_time.apply(slom_per_timeslice)
Gives the following output:
[########################################] | 100% Completed | 0.1s
[########################################] | 100% Completed | 0.1s
[########################################] | 100% Completed | 0.1s
[########################################] | 100% Completed | 0.1s
[########################################] | 100% Completed | 0.1s
[########################################] | 100% Completed | 0.1s
[########################################] | 100% Completed | 0.1s
[########################################] | 100% Completed | 0.1s
[########################################] | 100% Completed | 0.1s
hundreds of times.
So how do I properly see the progress of the overall calculation?
As a note, I am working small tests on a jupyter notebook on my personal desktop, and performing larger runs on the computer of the research group. I only have ssh access to this machine. Both cases are single machine, so I think the dask default multithreaded scheduler should suffice.
I have looked at the dask dashboard, but how would that work with only ssh access?
You can use
client = distributed.Client()
to set up a cluster that will work with the dashboard. See https://distributed.dask.org/en/latest/quickstart.html#setup-dask-distributed-the-easy-way