I am playing around with the nltk right now. I am trying to create various Classifiers with nltk, doing named entity recognition, to compare their results. Creating n-gram Taggers was easy, however I have run into some issues creating a ClassifierBasedTagger for Naive Bayes or the Decision Tree Classifier.
My Data is in the conll iob format. After reading it I covert it into a tuple that looks like that: (word, POS-tag), entity)
I have created the following class that creates the Classifiers:
class ClassifierChunker(ChunkParserI):
def __init__(self, trainSents, tagger, **kwargs):
if type(tagger) is not nltk.tag.sequential.UnigramTagger and type(tagger) is not nltk.tag.sequential.BigramTagger and type(tagger) is not nltk.tag.sequential.TrigramTagger:
self.featureDetector = tagger.feature_detector
self.tagger = tagger
def parse(self, sentence):
chunks = self.tagger.tag(sentence)
iobTriblets = [(word, pos, entity) for ((word, pos), entity) in chunks]
return conlltags2tree(iobTriblets)
def evaluate2(self, testSents):
return self.evaluate([conlltags2tree([(word, pos, entity) for (word, pos), entity in iobs]) for iobs in testSents])
Thats how I call it:
#naiveBayers
naiveBayers = NaiveBayesClassifier.train
naiveBayersTagger = ClassifierBasedTagger(train=completeTaggedSentencesTrain, feature_detector=features, classifier_builder=naiveBayers)
nerChunkerNaiveBayers = ClassifierChunker(completeTaggedSentencesTrain, naiveBayersTagger)
evalNaiveBayers = nerChunkerNaiveBayers.evaluate2(completeTaggedSentencesTest)
print(evalNaiveBayers)
The problem I have is in the first line of code (naiveBayers = NaiveBayesClassifier.train) I know I am supposed to pass the train function a labeled feature-set. However I am not exactly sure on what that means. In the documentation it says the following:
:param labeled_featuresets: A list of classified featuresets,
i.e., a list of tuples (featureset, label)
.
Would the featureset be the word and the label the entity?
After encountering this problem I have done some research and found the nltk-trainer. There the classifier_builder is created inside the args.py file, more specifically in the inner class "trainf" of the function "make_classifier_builder". However I have no idea where the variable "train_feats" is coming from. Maybe it has something to do with my limited understanding of inner functions. I can't find it being called anywhere.
I would really appriciate if someone could point me into the right direction.
edit: I have just read in the NLTK 3 Cookbook that the feature_detector function returns a feature set (p.143). So am I supposed to use that function in some way?
My current feature Detector looks the following and is taken out of that book:
def prev_next_pos_iob(tokens, index, history):
word, pos = tokens[index]
if index == 0:
prevword, prevpos, previob = ('<START>',) * 3
else:
prevword, prevpos = tokens[index - 1]
previob = history[index - 1]
if index == len(tokens) - 1:
nextword, nextpos = ('<END>',) * 2
else:
nextword, nextpos = tokens[index + 1]
feats = {
'word': word,
'pos': pos,
'nextword': nextword,
'nextpos': nextpos,
'prevword': prevword,
'prevpos': prevpos,
'previob': previob
}
return feats