I was wondering whether there is a way to access and work with dimension labels when using the apply_ufunc
function of xarray and @guvectorize
of numba. From my understanding, having labeled dimensions is one of the advantages of xarray but I wasn't really able to use them within the functions.
In my case, I want to store valid values to fill NaN values along the dimension time
for a certain max. amount of days, so the most convenient way would be using the xarray function ffill for propagating the values forward. But I also want to use the timestamp labels of the dimension time
to keep track of the original date of the value that is being propagated. I was able doing that by iterating over each cell of my 3-d array (x,y,time), but performance-wise this approach is not working for my large dataset which is why I turned to apply_ufunc
.
Some dummy data:
values = np.random.rand(2,2,10)*10
values.ravel()[np.random.choice(values.size, 30, replace=False)] = np.nan
lon = [[-99.83, -99.32], [-99.79, -99.23]]
lat = [[42.25, 42.21], [42.63, 42.59]]
ds = xr.Dataset({"value_day": (["x", "y", "time"], values),},
coords={"lon": (["x", "y"], lon),"lat": (["x", "y"], lat), "time": pd.date_range("2018-01-01", periods=10),},)
<xarray.Dataset>
Dimensions: (x: 2, y: 2, time: 10)
Coordinates:
lon (x, y) float64 -99.83 -99.32 -99.79 -99.23
lat (x, y) float64 42.25 42.21 42.63 42.59
* time (time) datetime64[ns] 2018-01-01 2018-01-02 ... 2018-01-10
Dimensions without coordinates: x, y
Data variables:
value_day (x, y, time) float64 nan nan 7.261 nan ... nan nan 7.411 nan
I have a working function for propagating the values, but I don't know, if possible at all, how to access the time label and return it. I commented it where I would use the time label. It would also be fine having two functions (value propagating + time propagating) and use a wrapper afterwards.
@guvectorize(
"(float64[:], float64[:])",
"(n) -> (n)",
nopython=True,
)
def bookkeeping(ds_xy, out): #(ds_xy, ds_xy_time, out)
limit=5
#variables for first iteration
stored_value = ds_xy[0]
#stored_value_time = ds_xy_time[0]
ds_days_collect = []
#ds_days_time_collect = []
counter_stored_days = 0
#iterate over each date of timeseries
for value_current in ds_xy:
#for value_current,time_current in zip(ds_xy,ds_xy_time):
if np.isnan(value_current) == False:
#overwrite stored value and corresponding date if new value for pixel
stored_value = value_current
#stored_value_time = time_current
else:
#keep stored value for bookkeeping up to 3 days
if counter_stored_days > limit:
#use nan value of day
stored_value = value_current
#stored_value_time = time_current
counter_stored_days = 0
else:
counter_stored_days += 1
ds_days_collect.append(stored_value)
#ds_days_time_collect.append(stored_value_time)
out[:] = np.asarray(ds_days_collect) #,ds_days_time_collect
def bookkeeping_ufunc(ds):
return xr.apply_ufunc(
bookkeeping,
ds,
input_core_dims= [["time"]],
output_core_dims=[["time"]]
)
bookkeeping_ufunc(ds)