I would appreciate your comments/help about a strategy I am applying in one of my analysis. In short, my case is:
1) My data have biological origin, collected in a period of 120s, from a
subject receiving, each time, one of possible three stimuli (response label 1
to 3), in a random manner, one stimulus per second (trial). Sampling
frequency is 256 Hz and 61 different sensors (input variables). So, my
dataset has 120x256 rows and 62 columns (1 response label + 61 input
variables);
2) My goal is to identify if there is an underlying pattern for each stimulus.
For that, I would like to use deep learning neural networks to test my
hypothesis, but not in a conventional way (to predict the stimulus from a
single observation/row).
3) My approach is to divide the whole dataset, after shuffling per row
(avoiding any time bias), in training and validation sets (50/50) and then to
run the deep learning algorithm. The division does not segregate trial events
(120), so each training/validation sets should contain data (rows) from the
same trial (but never the same row). If there is a dominant pattern per
stimulus, the validation confusion matrix error should be low. If there is a
dominant pattern per trial, the validation confusion matrix error should be
high. So, the validation confusion matrix error is my indicator of the
presence of a hidden pattern per stimulus;
I would appreciate any input you could provide me regarding the validity of my logic. I would like to emphasize that I am not trying to predict the stimulus based on row inputs.
Thanks.
Yes, you can use the classification performance in the cross-validation set that exceeds chance to argue that there is a pattern or relationship within the exemplars for each class. The argument will be stronger if similar performance is found in a separate, never-before seen, testing set.
If a deep neural network, SVM, or any other classifier can classify better than chance, it implies:
So, if classification performance exceeds chance, then the above 3 conditions are true. If it does not, then one or more of the conditions could be false. The training variables might not contain any information that's helpful in predicting the class. Or predictive variables are chosen, but the relationship between them and the class is too complicated for the classifier to learn. Or the classifier over-learned, and the CV set performance is at chance level or worse.
Here is a paper (open-access) that used similar logic to argue that fMRI activity contains information about images that a person is looking at:
Natural Scene Categories Revealed in Distributed Patterns of Activity in the Human Brain
NOTE: Depending on a classifier used (esp. DNN's but less so with decision trees), this will only tell you IF there is a pattern, it will not tell you WHAT that pattern is.