I have a list of CSV files that contains accelerometer (Frequency 100Hz) data from several subjects. I first read in all CSV-files into a list called "subjects". Here is a short snipped on how each dataset in the "subject" list looks like:
test = subjects[0]
print(test.head())
print(test.info())
x y z label
0 0.000964 -0.001134 0.006626 0
1 0.001184 -0.001213 0.009387 0
2 0.000443 -0.001731 0.008007 0
3 -0.000256 -0.000379 0.006897 0
4 0.000328 0.000040 0.005098 0
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1229597 entries, 0 to 1229596
Data columns (total 4 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 x 1229597 non-null float64
1 y 1229597 non-null float64
2 z 1229597 non-null float64
3 label 1229597 non-null int64
dtypes: float64(3), int64(1)
memory usage: 37.5 MB
None
Now I want to build a classification model (e.g. Random Forest) to be able to predict the label (range: 0-3).
At this point, I am not sure how to split my data into train and test datasets. As it is time series data, I think I can't use the classic train-test-split function from scikit-learn.
So how do I have to make the split at this task?
My first thought was to use the first 70% of the subjects as train data, the next 10% as validation data, and the remaining 20% as test data:
# 41 subjects in total
train_data = subjects[:29]
validation_data = subjects[29:33]
test_data = subjects[33:]
I am not sure if this is the correct way and don't know how I should proceed here to build a classification model.
Thanks in advance!