Getting specific cell value from NetCDF file slow on first execution

190 Views Asked by At

I am accessing a netcdf file using the xarray python library. The specific file that I am using is publicly available.

So, the file has several variables, and for most of these variables the dimensions are: time: 4314, x: 700, y: 562. I am using the ET_500m variable, but the behaviour is similar for the other variables as well. The chunking is: 288, 36, 44.

I am retrieving a single cell and printing the value using the following code:

import xarray as xr
ds = xr.open_dataset('./dataset_greece.nc')
print(ds.ET_500m.values[0][0][0])

According to my understanding, xarray should locate directly the position of the chunk that contains the corresponding value in disk and read it. Since the size of the chunk should not be bigger than a couple of MBs, I would expect this to take a few seconds or even less. But instead, it takes more than 2 minutes.

If, in the same script, I retrieve the value of another cell, even if it is located in a different chunk (e.g. print(ds.ET_500m.values[1000][500][500])), then this second retrieval takes only some milliseconds.

So my question is what exactly causes this overhead in the first retrieval?

EDIT: I just saw that in xarray open_dataset there is the optional parameter cache, which according to the manual:

If True, cache data loaded from the underlying datastore in memory as NumPy arrays when accessed to avoid reading from the underlying data- store multiple times. Defaults to True [...]

So, when I set this to False, subsequent fetches are also slow like the first one. But my question remains. Why is this so slow since I am only accessing a single cell. I was expecting that xarray directly locates the chunk on disk and only reads a couple of MBs.

1

There are 1 best solutions below

0
On BEST ANSWER

Rather than selecting from the .values property, subset the array first:

print(ds.ET_500m[0, 0, 0].values)

The problem is that .values coerces the data to a numpy array, so you're loading all of the data and then subsetting the array. There's no way around this for xarray - numpy doesn't have any concept of lazy loading, so as soon as you call .values xarray has no option but to load (or compute) all of your data.

If the data is a dask-backed array, you could use .data rather than .values to access the dask array and use positional indexing on the dask array, e.g. ds.ET_500m.data[0, 0, 0]. But if the data is just a lazy-loaded netCDF .data will have the same load-everything pitfall described above.