Why is accessing values in xarray so slow?

662 Views Asked by At

I need to access a bunch of historical weather data and use ERA5 dataset (> 1 Mio. specific locations points at specific timestamps).

I access it through a downloaded GRIB file and the xarray package. Unfortunately, the script runs quite slow: every single point needs about 250 to 300 ms ms of processing. In large parts, this is due to accessing and computing the daily values (130 to 200 ms).

Do you know why it takes so long? Can I optimize it somehow (besides parallelization)?

I already tried creating a deep copy of weatherdata_oneday, this didn't change anything (it was neither faster nor slower).

My code is:

import xarray as xr
import pandas as pd
import zoneinfo
import datetime

KELVIN_TO_DEGREE_CELSIUS = -273.15

ds = xr.open_dataset("era5_hourly_single_levels_extract.grib", engine='cfgrib', 
                     backend_kwargs={'filter_by_keys': {'typeOfLevel':'surface', 'edition': 1}})

coords_times = pd.read_csv("coords_times.csv")


all_results = {}


def to_naive(timestamp_with_timezone):
    timestamp_in_utc = timestamp_with_timezone.astimezone(datetime.timezone.utc)
    naive_timestamp = timestamp_in_utc.replace(tzinfo = None)
    return(naive_timestamp)


for index, row in coords_times.iterrows():
    loc = ds.sel(longitude=[row["lon"]], latitude=[row["lat"]], method="nearest")

    # some timestamp handling - just included for the sake of completeness
    timestamp_exact = datetime.datetime.fromisoformat(row["timestamp"])
    timestamp_start_hour = datetime.datetime(timestamp_exact.year, timestamp_exact.month, timestamp_exact.day, timestamp_exact.hour, 0, 0, tzinfo=timestamp_exact.tzinfo)
    timestamp_current_day_begin = datetime.datetime(timestamp_exact.year, timestamp_exact.month, timestamp_exact.day, 0, 0, 0, tzinfo=timestamp_exact.tzinfo)
    timestamp_current_day_end = timestamp_current_day_begin + datetime.timedelta(hours = 23)

    weatherdata_oneday = loc.sel(time = slice(to_naive(timestamp_current_day_begin), to_naive(timestamp_current_day_end)))

    # critical code part
    # this takes between 130 and 200 ms
    this_day_results = {
            "day_temperature_min" : weatherdata_oneday.t2m.min().values + KELVIN_TO_DEGREE_CELSIUS,
            "day_temperature_max" : weatherdata_oneday.t2m.max().values + KELVIN_TO_DEGREE_CELSIUS,
            "day_temperature_mean" : weatherdata_oneday.t2m.mean().values + KELVIN_TO_DEGREE_CELSIUS,
            "day_cloudcover_mean" : weatherdata_oneday.tcc.mean().values,
            "day_cloudcover_hours_low" : (weatherdata_oneday.tcc.values < 0.2).sum()
    }

    this_results = {**this_day_results}
    all_results[row["id"]] = this_results

Thank you very much!

1

There are 1 best solutions below

2
On

Okay, I found the reason. Accessing values can be quite slow in xarray, as the data is "lazy loaded".

Adding the following before any access is performed, reduces the time required to create the resulting dict to practically 0.

ds.load()

So the resulting code is:


[...]

ds = xr.open_dataset("era5_hourly_single_levels_extract.grib", engine='cfgrib', 
                         backend_kwargs={'filter_by_keys': {'typeOfLevel':'surface', 'edition': 1}})

ds.load()

[...]

These resources helped me: