Train a non-english Stanford NER models

542 Views Asked by At

I'm seeing several posts about training the Stanford NER for other languages.

eg: https://blog.sicara.com/train-ner-model-with-nltk-stanford-tagger-english-french-german-6d90573a9486

However, the Stanford CRF-Classifier uses some language dependent features (such as: Part Of Speechs tags).

Can we really train non-English models using the same Jar file? https://nlp.stanford.edu/software/crf-faq.html

2

There are 2 best solutions below

0
On

I agree with previous comment that NER classification model is language independent.

If you have issue with training data I could suggest you this link with a huge amount of labeled datasets for different languages.

If you would like to try another model, I suggest ESTNLTK - library for Estonian language, but it could fit language independent ner models (documentation). Also, here you could find example how to train ner model using spaCy.

I hope it helps. Good luck!

0
On

Training a NER classifier is language independent. You have to provide high quality training data and create meaningful features. The point is, that not all features are equally useful for every languages. Capitalization for instance, is a good indicator for a named entity in english. But in German all nouns are capitalized, which makes this features less useful.

In Stanford NER you can decide which features the classifier has to use and therefore you can disable POS tags (in fact, they are disabled by default). Of course, you could also provide your own POS tags in your desired language.

I hope I could clarify some things.