How do you find the most discriminant terms in binary document classification?

253 Views Asked by At

I want to use feature selection to find the terms in a document that are most useful for a binary classification task.

I've been looking around:
This mentions Mutual Information and the chi-squared test metric
http://nlp.stanford.edu/IR-book/html/htmledition/feature-selection-1.html

MATLAB has a number of functions as well:
http://www.mathworks.com/help/toolbox/stats/brj0qbu.html
Feature Selection in MATLAB
Of the above, relieff and rankfeatures look promising.

I do not know if my data follows a normal distribution. Any thoughts on which technique performs the best? Are there any newer methods you would suggest? The focus is to increase classification accuracy.

Thank you!

1

There are 1 best solutions below

0
On

Since the answer is highly dependent on the nature of your data, I'd suggest playing with several options, possibly using a hold-out set for verification. The easiest path would probably be to use Weka or RapidMiner for experimenting. Choosing from the plethora of options provided by them, you'll probably get acquainted with several other methods.

Having said that, I have found Mutual Information/Infogain to be useful on a large variety of problems.