Why is accessing values in xarray so slow?

Question

Why is accessing values in xarray so slow?

662 Views Asked by gabolo At 20 October 2024 at 01:35

I need to access a bunch of historical weather data and use ERA5 dataset (> 1 Mio. specific locations points at specific timestamps).

I access it through a downloaded GRIB file and the xarray package. Unfortunately, the script runs quite slow: every single point needs about 250 to 300 ms ms of processing. In large parts, this is due to accessing and computing the daily values (130 to 200 ms).

Do you know why it takes so long? Can I optimize it somehow (besides parallelization)?

I already tried creating a deep copy of weatherdata_oneday, this didn't change anything (it was neither faster nor slower).

My code is:

import xarray as xr
import pandas as pd
import zoneinfo
import datetime

KELVIN_TO_DEGREE_CELSIUS = -273.15

ds = xr.open_dataset("era5_hourly_single_levels_extract.grib", engine='cfgrib', 
                     backend_kwargs={'filter_by_keys': {'typeOfLevel':'surface', 'edition': 1}})

coords_times = pd.read_csv("coords_times.csv")


all_results = {}


def to_naive(timestamp_with_timezone):
    timestamp_in_utc = timestamp_with_timezone.astimezone(datetime.timezone.utc)
    naive_timestamp = timestamp_in_utc.replace(tzinfo = None)
    return(naive_timestamp)


for index, row in coords_times.iterrows():
    loc = ds.sel(longitude=[row["lon"]], latitude=[row["lat"]], method="nearest")

    # some timestamp handling - just included for the sake of completeness
    timestamp_exact = datetime.datetime.fromisoformat(row["timestamp"])
    timestamp_start_hour = datetime.datetime(timestamp_exact.year, timestamp_exact.month, timestamp_exact.day, timestamp_exact.hour, 0, 0, tzinfo=timestamp_exact.tzinfo)
    timestamp_current_day_begin = datetime.datetime(timestamp_exact.year, timestamp_exact.month, timestamp_exact.day, 0, 0, 0, tzinfo=timestamp_exact.tzinfo)
    timestamp_current_day_end = timestamp_current_day_begin + datetime.timedelta(hours = 23)

    weatherdata_oneday = loc.sel(time = slice(to_naive(timestamp_current_day_begin), to_naive(timestamp_current_day_end)))

    # critical code part
    # this takes between 130 and 200 ms
    this_day_results = {
            "day_temperature_min" : weatherdata_oneday.t2m.min().values + KELVIN_TO_DEGREE_CELSIUS,
            "day_temperature_max" : weatherdata_oneday.t2m.max().values + KELVIN_TO_DEGREE_CELSIUS,
            "day_temperature_mean" : weatherdata_oneday.t2m.mean().values + KELVIN_TO_DEGREE_CELSIUS,
            "day_cloudcover_mean" : weatherdata_oneday.tcc.mean().values,
            "day_cloudcover_hours_low" : (weatherdata_oneday.tcc.values < 0.2).sum()
    }

    this_results = {**this_day_results}
    all_results[row["id"]] = this_results

Thank you very much!

Original Q&A

There are 1 best solutions below

**gabolo** · Answer 1

Okay, I found the reason. Accessing values can be quite slow in xarray, as the data is "lazy loaded".

Adding the following before any access is performed, reduces the time required to create the resulting dict to practically 0.

ds.load()

So the resulting code is:


[...]

ds = xr.open_dataset("era5_hourly_single_levels_extract.grib", engine='cfgrib', 
                         backend_kwargs={'filter_by_keys': {'typeOfLevel':'surface', 'edition': 1}})

ds.load()

[...]

These resources helped me:

Why is accessing values in xarray so slow?

There are 1 best solutions below

Related Questions in PYTHON

Related Questions in PYTHON-XARRAY

Related Questions in CFGRIB

Trending Questions

Popular # Hahtags

Popular Questions