I am pretty new with working with Tensorflow and in coding in general. So I'm sorry if this question seems trivial or was answered somewhere else in a way that I maybe didn't understand as a solution to my problem.
TL:DR:
Can't figure out how to properly transform 100 CSVs (every one is a training or test sequence) containing 13 columns with length num_timesteps into the required dimensions for the LSTM input (100, num_timesteps, 13). Target for each CSV is available through the folder stucture it is saved in.
What I am working on
The project I'm working on is about multivariate time series classification. The training data I'm using are about 100 pandas dataframes containing sensor data, saved as CSVs (would like to switch to pickles), consisting of 13 features/time series. The columns/time series within a dataframe have in all cases the same length, but the dataframes themself vary in length. Since LSTMs seem to support time series data of different length (1, 2) in different batches I would like try it with varying lengths and add zero padding if the results are not good. The CSVs are saved in different folders, indicating the 5 targets.
So my data is simplified looking something like this:
For target1/example_t1_1.csv with length n and the target 1 it would be:
| time | featureA | featureB | ... | featureK |
|---|---|---|---|---|
| t1 | a(t1) | b(t1) | ... | K(t1) |
| t2 | a(t2) | b(t2) | ... | K(t2) |
| ... | ... | ... | ... | ... |
| tn | a(tn) | b(tn) | ... | K(tn) |
and target2/example_t2_1.csv with length j and the target 2:
| time | featureA | featureB | ... | featureK |
|---|---|---|---|---|
| t1 | a(t1) | b(t1) | ... | K(t1) |
| t2 | a(t2) | b(t2) | ... | K(t2) |
| ... | ... | ... | ... | ... |
| tj | a(tj) | b(tj) | ... | K(tj) |
and so on.
What my goal is
My goal would be to mainly use the tf.data functions to build a data pipeline that reads in my data directly into a tf Dataset, on which I then perform the other operations like train-test-split, scaling, etc. At least it seems to me this is the way you are supposed to do it.
Or at least having a more elegant and convenient way of handling multivariate time series data and a way of labeling the data through the folder location as what I do now. Also working right now with a 3d np.array I can't figure out how to read in data of different length without zero padding.
My problem
I find only resources on how to deal with tf.data with a single dataframe where every row is basically one individual "training instance" (3, 4, 5) while in my case I don't just want to read in one dataframe into the dataset but multiple where the rows should also be treated associated. I just can't wrap my head around if and how it would be possible to do. In my understanding, my goal shoud be, to create a some kind of dataset with the dimensions [num_of_different_dfs, len_of_df (=None), num_of_features(13)]. With this video (6) I at least managed to get the targets into a tf dataset, but I didn't manage to do the same with the actual data.
I hope my question with explanations on what my thoughts are were understandable. I would be really happy if someone could provide some more information on how you would correctly handle multivariate time series data or guide me towards the right resources. Thank you for taking your time to read all of this!
How I read in the data at the moment
Until now I did all the reading in the data "by hand": I wrote two functions to search for the CSV with the shortest time series (8 samples) and read every one in with that length as a concatenated dataframe for every target. (I know, throwing away data is a big no-no but at that time I didn't know better.) So I ended up with 5 different dataframes, containing the time series data for every target. Knowing the number of individual time series a function created a target vector for every target. Later concatenated the 5 dataframes as NumPy arrays and did the same for target vectors, leading to me having one "X" 2d array of size (8*100, 13) and a "y" target vector of size (100, 1). The X then I turned into a 3d array of size (100, 8, 13). Since I kept my testing data in a different folder, on which I performed the same steps, I did not train-test-split the X and y. Would probably not want do it like that again :D On the end I created a MinMax-Scaler with sklearn on the X and applied it then on the X_test. All of this works kinda but seems to not be a suitable approach for multivariate time series, especially of different length. I hope this was understandable and you can see that this is quite a roundabout way.