Stratified k-fold cross-validation for token-level labeled NER dataset

87 Views Asked by Oguzhan At 05 October 2023 at 14:44

I have a dataset used for Named Entity Recognition (NER). It is token-level labeled and consider it as list of lists:

data = [[["I", "was", "in", "New", "York"], ["O","O","O","B-LOC","I-LOC"]],
        [["Einstein", "is", "a", "physicist"], ["B-PER","O","O","O"]], ...]

I am trying to implement a stratified k-fold cross-validation to ensure that each fold is representative of the overall distribution of entity labels in the dataset. For example, a fold has train and test splits, where both splits have the same ratio of classes, such as PER, ORG, and so on. The well-known library sklearn.model_selection.StratifiedKFold cannot handle this issue. I've thought about flattening the data and applying StratifiedKFold, but this is not the right approach, especially given the need to ensure that entire sequences remain intact (i.e., not split across different folds).

Is there an effective way to create stratified k-fold splits for this type of dataset in Python?

Original Q&A

Stratified k-fold cross-validation for token-level labeled NER dataset

There are 0 best solutions below

Related Questions in PYTHON

Related Questions in SCIKIT-LEARN

Related Questions in NAMED-ENTITY-RECOGNITION

Related Questions in K-FOLD

Trending Questions

Popular # Hahtags

Popular Questions