Human Activity Recognition - Train-Test-Split and Modeling

32 Views Asked by At

I have a list of CSV files that contains accelerometer (Frequency 100Hz) data from several subjects. I first read in all CSV-files into a list called "subjects". Here is a short snipped on how each dataset in the "subject" list looks like:

test = subjects[0]
print(test.head())
print(test.info()) 

          x         y         z  label
0  0.000964 -0.001134  0.006626      0
1  0.001184 -0.001213  0.009387      0
2  0.000443 -0.001731  0.008007      0
3 -0.000256 -0.000379  0.006897      0
4  0.000328  0.000040  0.005098      0
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1229597 entries, 0 to 1229596
Data columns (total 4 columns):
 #   Column  Non-Null Count    Dtype  
---  ------  --------------    -----  
 0   x       1229597 non-null  float64
 1   y       1229597 non-null  float64
 2   z       1229597 non-null  float64
 3   label   1229597 non-null  int64  
dtypes: float64(3), int64(1)
memory usage: 37.5 MB
None

Now I want to build a classification model (e.g. Random Forest) to be able to predict the label (range: 0-3).

At this point, I am not sure how to split my data into train and test datasets. As it is time series data, I think I can't use the classic train-test-split function from scikit-learn.

So how do I have to make the split at this task?

My first thought was to use the first 70% of the subjects as train data, the next 10% as validation data, and the remaining 20% as test data:

# 41 subjects in total
train_data = subjects[:29]
validation_data = subjects[29:33]
test_data = subjects[33:]

I am not sure if this is the correct way and don't know how I should proceed here to build a classification model.

Thanks in advance!

0

There are 0 best solutions below