I am working in a binary classification problem and I'm using large text datasets which should be used for data matching. The data is imbalanced but I am using a method to fix this issue.
I want to try some classifiers with sklearn in small subsets of this dataset. Is there a way in sklearn to divide this dataset into N subsets, maintaining the proportion of the classes, so can I then divide each of these subsets into training/testing and fit the classifier independently for each subset?
I think sklearn’s StratifiedKFold might be what you are looking for. It will maintain the class proportions from the original dataset.