Downloading CMIP6 data via pyesgf - problem with lazy loading in xarray

49 Views Asked by At

I am quite new to xarray and would greatly appreciate your help. I am aiming to download CMIP6 ocean salinity data over Greenland. It fails upon saving the data to a netcdf file. The way I understand it is that since I am using chunk-size when opening the dataset it automatically uses dask in the background.

from pyesgf.search import SearchConnection
import xarray as xr
from pathlib import Path

path_out = Path('L:/test')

conn = SearchConnection('https://esgf-data.dkrz.de/esg-search', distrib=True) # gives results from less data nodes
#conn = SearchConnection('https://esgf-node.llnl.gov/esg-search', distrib=True) # gives results from more data nodes - with this somehow it is also possible to get the variant label r1i1p1f1 .. but then it's super slow.

ctx = conn.new_context(project='CMIP6', variant_label='r1i1p1f2', frequency='mon',
                       experiment_id='ssp585', variable='so',
                       from_timestamp="2020-12-30T23:23:59Z",
                       to_timestamp="2100-01-01T00:00:00Z",
                       latest=True)

results = ctx.search(ignore_facet_check=True, batch_size=250)

print('Hits:', ctx.hit_count)
print('table_id:', ctx.facet_counts['table_id'])
print('variables:', ctx.facet_counts['variable'])
print('Realms:', ctx.facet_counts['realm'])
print('Ensembles:', ctx.facet_counts['variant_label'])
print('Models:', ctx.facet_counts['source_id'])
print('grid_label:', ctx.facet_counts['grid_label'])
print('experiment_id:', ctx.facet_counts['experiment_id'])


for i in range(0, len(results)):

    files = results[i].file_context().search(ignore_facet_check=True)

    for j in range(0, len(files)):

        filename = files[j].filename

        ds = xr.open_dataset(files[j].opendap_url, chunks={'time': 120, 'lat': 50, 'lon': 50})
        da = ds['so']
        da = da.mean(dim='lev')
        max_lat, min_lat = 66, 58
        max_lon, min_lon = 320, 306
        da = da.sel(time=slice('2020-01-01', '2100-01-01'))
        if 'lat' in da.coords:
            da = da.where((da.lat > min_lat) & (da.lat < max_lat) &
                          (da.lon < (360-min_lon)) & (da.lon > (360-max_lon)), drop=True)
        elif 'latitude' in da.coords:
            da = da.where((da.latitude > min_lat) & (da.latitude < max_lat) &
                          (da.longitude < (360-min_lon)) & (da.longitude > (360-max_lon)), drop=True)
        else:
            print('check coords of this file: '+filename)

        file_name_out = path_out / filename
        da.to_netcdf(file_name_out)

I have played around with different chunk-sizes etc. but it still requires too much memory to download the file. A similar script to the one below was used successfully for CMIP6 temperatures (using xr.open_mfdataset), but it seems to fail now because of the much larger file sizes of ocean data.

0

There are 0 best solutions below