I have downloaded Conll 2003 corpus ("eng.train"). I want to use it to extract entity using python crfsuite training. But I don't know how to load this file for training.
I found this example, but it is not for English.
train_sents = list(nltk.corpus.conll2002.iob_sents('esp.train'))
test_sents = list(nltk.corpus.conll2002.iob_sents('esp.testb'))
Also in future I would like to train new entities other than POS or location. How can I add those.
Also please suggest how to handle multiple words.
You can use ConllCorpusReader.
Here a general impelemantation:
ConllCorpusReader('file path', 'file name', columntypes=['','',''])
Here a list of column types which you can use:
'WORDS', 'POS', 'TREE', 'CHUNK', 'NE', 'SRL', 'IGNORE'
Example: