How can I calculate percentile for every single data inside an xarray dataset

60 Views Asked by At

I have a dataset of one variable with the dimension of time, lat, Lon. The dataset looks like the following and it has several grids where there is NaN values:

 <xarray.Dataset>
Dimensions:    (time = 6300, latitude: 300, longitude: 360)
Coordinates:
  * latitude   (latitude) float64 49.62 49.88 50.12 50.38 ... 70.88 71.12 71.38
  * longitude  (longitude) float64 -9.875 -9.625 -9.375 ... 39.38 39.62 39.88
  * time       (time) datetime64[ns] 1950-06-01 1950-06-02 ... 2018-08-31
Data variables:
    precip (time, latitude, longitude) float32 dask.array<chunksize=(6300, 300, 360) 

I want to calculate the percentile of each value within the dataset. The desired xarray dataset will be like the following. Please note that I will be calculating percentile along time axis (for each pixels, want to use the timeseries to of the corresponding pixel during calculation of the percentile):

 <xarray.Dataset>
Dimensions:    (time = 6300, latitude: 300, longitude: 360)
Coordinates:
  * latitude   (latitude) float64 49.62 49.88 50.12 50.38 ... 70.88 71.12 71.38
  * longitude  (longitude) float64 -9.875 -9.625 -9.375 ... 39.38 39.62 39.88
  * time       (time) datetime64[ns] 1950-06-01 1950-06-02 ... 2018-08-31
Data variables:
    precip_percentile (time, latitude, longitude) float32 dask.array<chunksize=(6300, 300, 360)

I did some exploring and I am using the following code to calculate the percentile using xarray.ufunc:

def percentileofscore_weak(x):
    return stats.percentileofscore(x, x, kind='rank')

# Apply percentileofscore_weak along the time axis using apply_ufunc
percentiles = xr.apply_ufunc(
    percentileofscore_weak,
    mean_month,
    input_core_dims=[['time']],
    output_core_dims=[[]],
    dask='parallelized',  # Enable parallelization for large datasets
    dask_gufunc_kwargs={'allow_rechunk': False}
)

The above code generates a percentile like following:

xarray.DataArray 'percentile' lat: 300, lon: 360. That means it took the time dimension away.

How can I calculate percentile for individual values using each grid's corresponding timeseries and generate an xarray dataset like following:

 <xarray.Dataset>
Dimensions:    (time = 6300, latitude: 300, longitude: 360)
Coordinates:
  * latitude   (latitude) float64 49.62 49.88 50.12 50.38 ... 70.88 71.12 71.38
  * longitude  (longitude) float64 -9.875 -9.625 -9.375 ... 39.38 39.62 39.88
  * time       (time) datetime64[ns] 1950-06-01 1950-06-02 ... 2018-08-31
Data variables:
    precip_percentile (time, latitude, longitude) float32 dask.array<chunksize=(6300, 300, 360)
0

There are 0 best solutions below