Which technique should be applied to split a large text dataset for data matching?

153 Views Asked by At

I am working in a binary classification problem and I'm using large text datasets which should be used for data matching. The data is imbalanced but I am using a method to fix this issue.

I want to try some classifiers with sklearn in small subsets of this dataset. Is there a way in sklearn to divide this dataset into N subsets, maintaining the proportion of the classes, so can I then divide each of these subsets into training/testing and fit the classifier independently for each subset?

2

There are 2 best solutions below

0
On

I think sklearn’s StratifiedKFold might be what you are looking for. It will maintain the class proportions from the original dataset.

1
On

StratifiedKFold is a module in sklearn.preprocessing that may do the job. Suppose your data is stored in X (features) and y (target).

The method splits your dataset into N parts, and each chunk is split into train and test subsets by default. As you can see from the code, the splitter returns indices, rather than the split data.

# Import module
from sklearn.preprocessing import StratifiedKFold

# Set N
N = 5

# Initialize a splitter that will divide data into N groups
kf = StratifiedKFold(n_splits=N)

# Append the indices of each of the N splits to a list
idx_splits = []
for idx_1, idx_2 in kf.split(X, y):
    idx_splits.append((idx_1, idx_2))

# Get the third train split
X[idx_splits[3][0]]
y[idx_splits[3][0]]

# Get the third test split
X[idx_splits[3][1]]
y[idx_splits[3][1]]