Machine Learning Text Classification technique

484 Views Asked by At

I have large number(say 3000)key words.These need to be classified into seven fixed categories.Each category is having training data(sample keywords).I need to come with a algorithm, when a new keyword is passed to that,it should predict to which category this key word belongs to.

I am not aware of which text classification technique need to applied for this; do we have any tools that can be used?

4

There are 4 best solutions below

1
On

This comes under linear classification. You can use naive-bayes classifier for this. Most of the ml frameworks will have an implementation for naive-bayes. ex: mahout

0
On

Yes, I would also suggest to use Naive Bayes, which is more or less the baseline classification algorithm here. On the other hand, there are obviously many other algorithms. Random forests and Support Vector Machines come to mind. See http://machinelearningmastery.com/use-random-forest-testing-179-classifiers-121-datasets/ If you use a standard toolkit, such as Weka, Rapidminer, etc. these algorithms should be available. There is also OpenNLP for Java, which comes with a maximum entropy classifier.

0
On

You could use the Word2Vec Word Cosine distance between descriptions of each your category and keywords in the dataset and then simple match each keyword to a category with the closest distance

Alternatively, you could create a training dataset from already matched to category, keywords and use any ML classifier, for example, based on artificial neural networks by using vectors of keywords Cosine distances to each category as an input to your model. But it could require a big quantity of data for training to reach good accuracy. For example, the MNIST dataset contains 70000 of the samples and it allowed me reach 99,62% model's cross validation accuracy with a simple CNN, for another dataset with only 2000 samples I was able reached only about 90% accuracy

0
On

There are many classification algorithms. Your example looks to be a text classification problems - some good classifiers to try out would be SVM and naive bayes. For SVM, liblinear and libshorttext classifiers are good options (and have been used in many industrial applcitions):

liblinear: https://www.csie.ntu.edu.tw/~cjlin/liblinear/ libshorttext:https://www.csie.ntu.edu.tw/~cjlin/libshorttext/

They are also included with ML tools such as scikit-learna and WEKA.