Stratified k-fold cross-validation for token-level labeled NER dataset

87 Views Asked by At

I have a dataset used for Named Entity Recognition (NER). It is token-level labeled and consider it as list of lists:

data = [[["I", "was", "in", "New", "York"], ["O","O","O","B-LOC","I-LOC"]],
        [["Einstein", "is", "a", "physicist"], ["B-PER","O","O","O"]], ...] 

I am trying to implement a stratified k-fold cross-validation to ensure that each fold is representative of the overall distribution of entity labels in the dataset. For example, a fold has train and test splits, where both splits have the same ratio of classes, such as PER, ORG, and so on. The well-known library sklearn.model_selection.StratifiedKFold cannot handle this issue. I've thought about flattening the data and applying StratifiedKFold, but this is not the right approach, especially given the need to ensure that entire sequences remain intact (i.e., not split across different folds).

Is there an effective way to create stratified k-fold splits for this type of dataset in Python?

0

There are 0 best solutions below