Text classification using bag of words

1.8k Views Asked by At

I'm facing a machine learning problem. Basically, I'm trying to classify some text into categories (labels), so this is a supervised classification algorithm. I have training data, with texts and their corresponding labels. Through a bag of words method, I've managed to transform each text into a list of most occuring words, just like in this image : bag of words

As you can see, the lists have different sizes (because of the input data where the text is sometimes very short...).

So now, I have a training data frame with these lists of words and their corresponding labels. However, I'm quite confused about how I should proceed next to implement my machine learning algorithm. How to modify the lists so that I can use a classifier ?

I've looked at one-hot-encoding, but the problem here is :

  • the different sizes of each list and the random place of each word inside the list
  • how to encode one list with the appearance of the possible 0s from an other list

---> example

INPUT:

L1= ['cat','dog','home','house']

L2=['fish','cat','dog']

OUTPUT:

Vector1 = [1,1,1,1,0]
Vector2=[1,1,0,0,1]

Also, just from this example I imagine that even if I did that, the resulting vectors might have a very important size.

I hope this makes sense, I'm quite new to machine learning. However, I'm not even sure the bag of words method I've made is really helping, so don't hesitate to tell me if you think I'm going in the wrong direction.

I'm using pandas and scikit-learn and it is my first time that I'm confronted to a text classification issue.

Thanks for you help.

1

There are 1 best solutions below

0
On

I would suggest using NLTK and specifically nltk.classify.naivebayes. Take a look at the example here: http://www.nltk.org/book_1ed/ch06.html. You will need to build a feature extractor. I would do something like the following (untested) code:

from nltk.classify import NaiveBayesClassifier

def word_feats(words):
    return dict([(word.lower(), True) for word in words])

train_data = [ (word_feats(L1), 'label1'), (word_feats(L2), 'label2') ]

classifier = NaiveBayesClassifier.train(train_data)

test_data = ["foo"]

classifier.classify(test_data)