Unlabeled instances in DOCCANO and SpaCY. Do they offer any value?

218 Views Asked by At

I am using doccano for sequence labelling and spacy for further modeling. Some of the sentences I label do not contain any of the labels I am interested in, so they remain "unlabeled" ie. no tags.

{"id": 79, "data": "This powerful charm would protect him until he became of age, or no longer called his aunt's house home.", "label": []}
{"id": 82, "data": "He began attending Hogwarts School of Witchcraft and Wizardry in 1991.", "label": []}
{"id": 85, "data": "He later became the youngest Quidditch Seeker in over a century and eventually the captain of the Gryffindor House Quidditch Team in his sixth year, winning two Quidditch Cups.", "label": []}

I want to train SpaCy to recognise character names in all their variations.

Now the questions:

  • is there any value in including unlabeled instances for the purpose of training SpaCy model?
  • if there is then should I declare this data as "imbalanced dataset" and act accordingly? (boost? smote? over-sampling? etc.)
  • what are the best practice in cases like this?
1

There are 1 best solutions below

0
On

Yes, you need to include some examples where nothing is labelled so the model can learn what not to label. For example, if in all your sample sentences all capitalized words are labelled, the model might learn to always label capitalized words.