Python: Is there a better way to work with ragged arrays than a list of arrays with dtype = object?

172 Views Asked by At

I am collecting time series data, which can be separated into "tasks" based on a particular target value. These tasks can be numbered based on the associated target. However, the lengths of data associated with each task will differ because it may take less time or more time for a "task" to be completed. Right now in MATLAB, this data is separated by the target number into a MATLAB cell, which is extremely convenient as the analysis on this time-series data will be the same for each set of data associated with each target, and thus I can complete data analysis simply by using a for loop to go through each cell in the cell array. My knowledge on the closest equivalent of this in Python would be to generate a ragged array. However, through my research on answering this question, I have found that automatic setting of a ragged array has been deprecated, and that if you want to generate a ragged array you must set dtype = object. I have a few questions surrounding this scenario:

  1. Does setting dtype=object for the ragged array come with any inherent limitations on how one will access the data within the array?

  2. Is there a more convenient way of saving these ragged arrays as numpy files besides reducing dimensionality from 3D to 2D and also saving a file of the associated index? This would be fairly inconvenient I think as I have thousands of files for which it would be convenient to save as a ragged array.

  3. Related to 2, is saving the data as a .npz file any different in practice in terms of saving an associated index? More specifically, would I be able to unpack the ragged arrays automatically based on a technically separate .npy file for each one and being able to assume that each set of data associated with each target is stored in the same way for every file?

  4. Most importantly, is using ragged arrays really the best equivalent set-up for my task, or do I get the deprecation warning about setting dtype=object because manipulating data in this way has become redundant and Python3 has a better method for dealing with stacked arrays of varying size?

1

There are 1 best solutions below

0
On

I have decided to move forward with a known solution to my problem, and it seems to be adapting well. I organize each set of separate data into it's own array, and then store them in a sequence in a list as I would with cells in MATLAB. To save this information, when I separated out the data I stored the subsequent index value in a list. By this I mean that:

  1. I identify the location of the next separate set of data.
  2. I copy the data up until that index value into an array that is appended to a list.
  3. I store the index value that was the start of the next separate set of data.
  4. I delete that information from a copy of my original array.
  5. I repeat steps 1-4 until there is only one uniquely labelled sequence of data left. I append this set of data. There is no other index to record. Therefore the list of indices is equal to the length of the list of arrays -1.
  6. When saving data, I take the original array and save it in a .npz file with the unpacking indices.
  7. When I want to use and reload the data into it's separate arrays for analysis, I can 'pack' and 'unpack' the array into it's two different forms, from single numpy array to list of numpy arrays.

This solution is working quite well. I hope this helps someone in the future.