What happened:
when trying to open 14,000 files in a list comprehension with xarray.open_rasterio
, the loop never completes. The goal is to open all these GeoTiff files, change the band dimension to a date dimension, stack by date, and save as a .single .nc. It can open less than 400 files no problem within the list comprehension. This is the case on two different Linux machines.
But when I create the list ahead of time and instead use a regular for loop to open each file and append the result to the list, it finishes on all 14000 files in about a minute as intended. Why is there a performance difference/possible bug when using a list comprehension? Is there something I'm missing about python list comprehension performance in general?
Also, I used the linux command lsof
to determine that pretty quickly in with the list comprehsension, about 446 files stayed open while the script never completed. The value fluctuated around 446 if I kept checking lsof
(it lists files opened by a process with lsof -p
). So it seems like some files weren't being closed in a normal amount of time.
What you expected to happen: I thought it would take about a minute, since opening a single file takes about 41 milliseconds.
Minimal Complete Verifiable Example: The data folder chirps-clipped can be downloaded here: https://ucsb.box.com/s/erqz20bgojhvpw2xpdbbcs17e131xxe4
import xarray as xr
import rioxarray as rio
from pathlib import Path
from datetime import datetime
%matplotlib inline
all_scenes_f = Path('../rasters/chirps-clipped')
all_precip_paths = list(all_scenes_f.glob("*"))
# for some reason the fll value is not correct. this is the correct bad value to mask by
testf = all_precip_paths[0]
x = rio.open_rasterio(testf)
badvalue = np.unique(x.where(x != x._FillValue).sel(band=1))[0]
def chirps_path_date(path):
_, _, year, month, day, _ = path.name.split(".")
day = day.split("-")[0]
return datetime(int(year), int(month), int(day))
def open_chirps(path):
data_array = rio.open_rasterio(path) #chunks makes i lazyily executed
data_array = data_array.sel(band=1).drop("band") # gets rid of old coordinate dimension since we need bands to have unique coord ids
data_array["date"] = chirps_path_date(path) # makes a new coordinate
return data_array.expand_dims({"date":1}) # makes this coordinate a dimension
### each data file is small and isn't tiled so it is not a good idea to use chunking
# https://github.com/pydata/xarray/issues/2314
import rasterio
with rasterio.open(testf) as src:
print(src.profile)
%timeit rio.open_rasterio(testf)
### This is where the file opening bug happens
daily_chirps_arrs = [xr.open_rasterio(path) for path in all_precip_paths]
Anything else we need to know?:
Environment:
Output of xr.show_versions()INSTALLED VERSIONS
------------------
commit: None
python: 3.8.5 | packaged by conda-forge | (default, Sep 16 2020, 18:01:20)
[GCC 7.5.0]
python-bits: 64
OS: Linux
OS-release: 5.4.0-53-generic
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: en_US.UTF-8
libhdf5: None
libnetcdf: None
xarray: 0.16.1
pandas: 1.1.4
numpy: 1.19.1
scipy: 1.5.2
netCDF4: None
pydap: None
h5netcdf: None
h5py: None
Nio: None
zarr: None
cftime: None
nc_time_axis: None
PseudoNetCDF: None
rasterio: 1.1.6
cfgrib: None
iris: None
bottleneck: None
dask: 2.27.0
distributed: 2.30.1
matplotlib: 3.3.2
cartopy: None
seaborn: None
numbagg: None
pint: None
setuptools: 49.6.0.post20200917
pip: 20.2.3
conda: None
pytest: None
IPython: 7.18.1
sphinx: None