How can I add frequency in nltk naivebayes classifier?

556 Views Asked by At

I'm now learning naivebayes classifier by using nltk.

In the document(http://www.nltk.org/book/ch06.html) 1.3 document classification, There is an featureset example.

featuresets = [(document_features(d), c) for (d,c) in documents]
train_set, test_set = featuresets[100:], featuresets[:100]
classifier = nltk.NaiveBayesClassifier.train(train_set)

all_words = nltk.FreqDist(w.lower() for w in movie_reviews.words())
word_features = list(all_words)[:2000] [1]

def document_features(document): [2]
    document_words = set(document) [3]
    features = {}
    for word in word_features:
        features['contains({})'.format(word)] = (word in document_words)
    return features

So the example of featuresets's form is {('contains(waste)': False, 'contains(lot)': False, ...},'neg')...}

But I want to change word dictionary form from 'contains(waste)': False to 'contains(waste)': 2. I think that that form('contains(waste)': 2) well explain document because it can calculate frequency of world. So the featureset would be {('contains(waste)': 2, 'contains(lot)': 5, ...},'neg')...}

But I'm worried about whether 'contains(waste)': 2 and 'contains(waste)': 1 are totally different word to naivebayesclassifier. Then it can't explain the similarity of 'contains(waste)': 2 and 'contains(waste)': 1.

{'contains(lot)': 1 and 'contains(waste)': 1} and {'contains(waste)': 2 and 'contains(waste)': 1} can be same to program.

Does nltk.naivebayesclassifier can understand the frequency of word?

This is the code I used

def split_and_count_word(data):
    #belongs_to : Main
    #Role : make featuresets from korean words using konlpy.
    #Parameter : dictionary data(dict of contents ex.{'politic':{'parliament': [content,content]}..})
    #Return : list featuresets([{'word':True',...},'politic'] == featureset + category)

    featuresets = []
    twitter = konlpy.tag.Twitter()#Korean word splitter

    for big_cat in data:

        for small_cat in data[big_cat]:
            #save category name needed in featuresets 
            category = str(big_cat[0:3])+'/'+str(small_cat)
            count = 0; print(small_cat)

            for one_news in data[big_cat][small_cat]:
                count+=1; if count%100==0: print(count,end=' ')                
                #one_news is list in list so open it!
                doc = one_news
                #split word as using konlpy
                list_of_splited_word = twitter.morphs(doc[:-63])#delete useless sentences. 
                #get word length is higher than two and get list of splited words
                list_of_up_two_word = [word for word in list_of_splited_word if len(word)>1]
                dict_of_featuresets = make_featuresets(list_of_up_two_word)
                #save 
                featuresets.append((dict_of_featuresets,category))

    return featuresets


def make_featuresets(data):
    #belongs_to : split_and_count_word
    #Role : make featuresets
    #Parameter : list list_of_up_two_word(ex.['비누','떨어','지다']
    #Return : dictionary {word : True for word in data}

    #PROBLEM :(
    #cannot consider the freqency of word
    return {word : True for word in data}

def naive_train(featuresets):
    #belongs_to : Main
    #Role : Learning by naive bayes rule
    #Parameter : list featuresets([{'word':True',...},'pol/pal'])
    #Return : object classifier(nltk naivebayesclassifier object),
    #         list test_set(the featuresets that are randomly selected)

    random.shuffle(featuresets)
    train_set, test_set = featuresets[1000:], featuresets[:1000]
    classifier = naivebayes.NaiveBayesClassifier.train(train_set)

    return classifier,test_set

featuresets = split_and_count_word(data)
classifier,test_set = naive_train(featuresets)
1

There are 1 best solutions below

3
On

The nltk's Naive Bayes classifier treats feature values as logically distinct. Values are not limited to True and False, but they are never treated as quantities. If you have feature f=2 and f=3, they count as distinct values. The only way to add quantity to such a model is to sort them into "buckets" like f=1, f="few" (2-5), f="several" (6-10), f="many" (11 or more), for example. (Note: If you go this route, there are algorithms for choosing good value ranges for the buckets.) And even then the model does not "know" that "few" is between "one" and "several". You'll need a different machine learning tool to handle quantity directly.