How to NER with spaCy with custom dataset and custom tag

582 Views Asked by At

I have a set of texts about movie reviews. I want to use spaCy to extract from it the: actor, director, author, data of air etc. entities. However, spaCy only use a generic tag of PERSON.

What is the code to make spaCy find my entity ACTOR,director etc using my custom text?

1

There are 1 best solutions below

2
On

You don't need any special code to use new NER labels. By default, when you train a model, the labels are inferred from the training data. This is covered in the spaCy course.

Do note that you have to train a model, and can't just modify the existing NER model by adding labels to it.

Also note that things like Actor and Director are getting into Semantic Role Labelling, which is like NER but a harder problem for a computer. The spaCy course also goes over why this is difficult.


Assuming your data is in CONLL format and split into train/dev, the complete flow to train a model is:

spacy convert train.conll -o corpus
spacy convert dev.conll -o corpus
spacy init config -p ner ner.cfg
spacy train ner.cfg --paths.train corpus/train.spacy --paths.dev corpus/dev.spacy -o my-ner-model

It doesn't matter what your labels are when you do this - the labels in the pretrained pipelines are based on the training data there, and not hardcoded in spaCy's NER model in some way.

Note that your data is probably not in CONLL format, in which case you need to convert it - see the training data docs. Also this uses the default config settings, but you might want to use the accuracy settings instead of efficiency, or use transformers, depending on your needs.