How do I use Conll 2003 corpus in python crfsuite

1.9k Views Asked by user2550098 At 10 August 2017 at 17:19

I have downloaded Conll 2003 corpus ("eng.train"). I want to use it to extract entity using python crfsuite training. But I don't know how to load this file for training.

I found this example, but it is not for English.

train_sents = list(nltk.corpus.conll2002.iob_sents('esp.train'))
test_sents = list(nltk.corpus.conll2002.iob_sents('esp.testb'))

Also in future I would like to train new entities other than POS or location. How can I add those.

Also please suggest how to handle multiple words.

Original Q&A

There are 1 best solutions below

Olzhas Aldabergenov On 10 December 2018 at 15:47

You can use ConllCorpusReader.

Here a general impelemantation: ConllCorpusReader('file path', 'file name', columntypes=['','',''])

Here a list of column types which you can use: 'WORDS', 'POS', 'TREE', 'CHUNK', 'NE', 'SRL', 'IGNORE'

Example:

from nltk.corpus.reader import ConllCorpusReader

train = ConllCorpusReader('CoNLL-2003', 'eng.train', ['words', 'pos', 'ignore', 'chunk'])
test = ConllCorpusReader('CoNLL-2003', 'eng.testa', ['words', 'pos', 'ignore', 'chunk'])

How do I use Conll 2003 corpus in python crfsuite

There are 1 best solutions below

Related Questions in MACHINE-LEARNING

Related Questions in NLP

Related Questions in CRFSUITE

Related Questions in PYTHON-CRFSUITE

Trending Questions

Popular # Hahtags

Popular Questions