xarray variable from list of differing length lists

1k Views Asked by At

I have a function to create an xarray Dataset based on various output from a model. One of the pieces of information I collect is a list of lists (not same length). This variable is called cids and has the same dimension repo_id as other variables.

So far the following has always worked fine:

datetime = pd.date_range('20010101', periods=100, freq='D')
obs = [xr.DataArray(np.random.rand(100), dims={'datetime': datetime}),xr.DataArray(np.random.rand(100), dims={'datetime':datetime}) ]
cids = [[1, 2, 3], [1, 2, 3, 4]]
keys = np.array([['A', 'A', 'B'], ['C', 'D', 'E']])
xr.Dataset({'obs': (['repo_id', 'datetime'], np.array(obs)), 'cig_id': ('repo_id', keys[:, 0]), 'repo': ('repo_id', keys[:, 2]), 'cids': ('repo_id', cids)},  coords={'repo_id': keys[:, 1], 'datetime': obs[0].datetime})

This yields the following results, as expected:

<xarray.Dataset>
Dimensions:   (datetime: 100, repo_id: 2)
Coordinates:
  * repo_id   (repo_id) <U1 'A' 'D'
  * datetime  (datetime) int64 0 1 2 3 4 5 6 7 8 ... 91 92 93 94 95 96 97 98 99
Data variables:
    obs       (repo_id, datetime) float64 0.9393 0.468 0.7168 ... 0.03513 0.8771
    cig_id    (repo_id) <U1 'A' 'C'
    repo      (repo_id) <U1 'B' 'E'
    cids      (repo_id) object [1, 2, 3] [1, 2, 3, 4]

However, I recently had a case where the length of the lists in my cids variable was the same:

datetime = pd.date_range('20010101', periods=100, freq='D')
obs = [xr.DataArray(np.random.rand(100), dims={'datetime': datetime}),xr.DataArray(np.random.rand(100), dims={'datetime':datetime}) ]
# see here that length of elements in cids are both equal
cids = [[1, 2, 3], [1, 2, 3]]
keys = np.array([['A', 'A', 'B'], ['C', 'D', 'E']])
xr.Dataset({'obs': (['repo_id', 'datetime'], np.array(obs)), 'cig_id': ('repo_id', keys[:, 0]), 'repo': ('repo_id', keys[:, 2]), 'cids': ('repo_id', cids)},  coords={'repo_id': keys[:, 1], 'datetime': obs[0].datetime})

Which produces the following error:

cids = [[1, 2, 3], [1, 2, 3]]
keys = np.array([['A', 'A', 'B'], ['C', 'D', 'E']])
xr.Dataset({'obs': (['repo_id', 'datetime'], np.array(obs)), 'cig_id': ('repo_id', keys[:, 0]), 'repo': ('repo_id', keys[:, 2]), 'cids': ('repo_id', cids)},  coords={'repo_id': keys[:, 1], 'datetime': obs[0].datetime})
Traceback (most recent call last):
  File "/auto/anaconda3/envs/commod_staging/lib/python3.6/site-packages/xarray/core/variable.py", line 107, in as_variable
    obj = Variable(*obj)
  File "/auto/anaconda3/envs/commod_staging/lib/python3.6/site-packages/xarray/core/variable.py", line 309, in __init__
    self._dims = self._parse_dimensions(dims)
  File "/auto/anaconda3/envs/commod_staging/lib/python3.6/site-packages/xarray/core/variable.py", line 503, in _parse_dimensions
    "number of data dimensions, ndim=%s" % (dims, self.ndim)
ValueError: dimensions ('repo_id',) must have the same length as the number of data dimensions, ndim=2
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
  File "/auto/anaconda3/envs/commod_staging/lib/python3.6/site-packages/IPython/core/interactiveshell.py", line 3331, in run_code
    exec(code_obj, self.user_global_ns, self.user_ns)
  File "<ipython-input-48-9a2b518ac4d3>", line 2, in <module>
    xr.Dataset({'obs': (['repo_id', 'datetime'], np.array(obs)), 'cig_id': ('repo_id', keys[:, 0]), 'repo': ('repo_id', keys[:, 2]), 'cids': ('repo_id', cids)},  coords={'repo_id': keys[:, 1], 'datetime': obs[0].datetime})
  File "/auto/anaconda3/envs/commod_staging/lib/python3.6/site-packages/xarray/core/dataset.py", line 537, in __init__
    data_vars, coords, compat="broadcast_equals"
  File "/auto/anaconda3/envs/commod_staging/lib/python3.6/site-packages/xarray/core/merge.py", line 467, in merge_data_and_coords
    objects, compat, join, explicit_coords=explicit_coords, indexes=indexes
  File "/auto/anaconda3/envs/commod_staging/lib/python3.6/site-packages/xarray/core/merge.py", line 552, in merge_core
    collected = collect_variables_and_indexes(aligned)
  File "/auto/anaconda3/envs/commod_staging/lib/python3.6/site-packages/xarray/core/merge.py", line 277, in collect_variables_and_indexes
    variable = as_variable(variable, name=name)
  File "/auto/anaconda3/envs/commod_staging/lib/python3.6/site-packages/xarray/core/variable.py", line 113, in as_variable
    "{} to Variable.".format(obj)
ValueError: Could not convert tuple of form (dims, data[, attrs, encoding]): ('repo_id', [[1, 2, 3], [1, 2, 3]]) to Variable.

Input would be appreciated, not sure how best to handle this. It seems xarray is trying to be smart and assuming that the dimension of cids is not repo_id of length two, but rather length 3... a bug??

2

There are 2 best solutions below

1
On

Currently the first example creates a variable cids which contains a list:

In [6]: datetime = pd.date_range('20010101', periods=100, freq='D')
   ...: obs = [xr.DataArray(np.random.rand(100), dims={'datetime': datetime}),xr.DataArray(np.random.rand(100), dims={'datetime':datetime}) ]
   ...: cids = [[1, 2, 3], [1, 2, 3, 4]]
   ...: keys = np.array([['A', 'A', 'B'], ['C', 'D', 'E']])
   ...: xr.Dataset({'obs': (['repo_id', 'datetime'], np.array(obs)), 'cig_id': ('repo_id', keys[:, 0]), 'repo': ('repo_id', keys[:, 2]), 'cids': ('repo_id', cids)},  coords={'repo_id': keys[:, 1], 'datetime': obs[0].datetime})
   ...:
Out[6]:
<xarray.Dataset>
Dimensions:   (datetime: 100, repo_id: 2)
Coordinates:
  * repo_id   (repo_id) <U1 'A' 'D'
  * datetime  (datetime) int64 0 1 2 3 4 5 6 7 8 ... 91 92 93 94 95 96 97 98 99
Data variables:
    obs       (repo_id, datetime) float64 0.4451 0.9134 ... 0.8266 0.07039
    cig_id    (repo_id) <U1 'A' 'C'
    repo      (repo_id) <U1 'B' 'E'
    cids      (repo_id) object [1, 2, 3] [1, 2, 3, 4]

In [9]: ds=_

In [11]: ds.cids
Out[11]:
<xarray.DataArray 'cids' (repo_id: 2)>
array([list([1, 2, 3]), list([1, 2, 3, 4])], dtype=object)  # <- here
Coordinates:
  * repo_id  (repo_id) <U1 'A' 'D'

Is that intentional? Generally you would want to store a single value along each dimension, rather than a list.

I appreciate it's a confusing pair of cases, because it's surprising it would work for unequal sized lists but not for equally sized. Xarray is attempting to put the values in the list along another dimension, and is missing an extra dimension; while not attempting to do it for unequally sized lists.

The error message is bad. But I'm not sure what I'd change in the functionality: potentially it could raise an error on your first example given it's unlikely someone wants objects that are lists.

0
On

I suspect this may not be the most "xarrayonic" approach, but the following seems to provide me with a 'fix':

datetime = pd.date_range('20010101', periods=100, freq='D')
obs = [xr.DataArray(np.random.rand(100), dims={'datetime': datetime}),xr.DataArray(np.random.rand(100), dims={'datetime':datetime}) ]
# see here that length of elements in cids are both equal
## HERE IS THE FIX, CONVERT THEM TO SETS
cids = [set(_e) for _e in [[1, 2, 3], [1, 2, 3]]]

## THAT'S ALL
keys = np.array([['A', 'A', 'B'], ['C', 'D', 'E']])
xr.Dataset({'obs': (['repo_id', 'datetime'], np.array(obs)), 'cig_id': ('repo_id', keys[:, 0]), 'repo': ('repo_id', keys[:, 2]), 'cids': ('repo_id', cids)},  coords={'repo_id': keys[:, 1], 'datetime': obs[0].datetime})