I want to use feature selection to find the terms in a document that are most useful for a binary classification task.
I've been looking around:
This mentions Mutual Information and the chi-squared test metric
http://nlp.stanford.edu/IR-book/html/htmledition/feature-selection-1.html
MATLAB has a number of functions as well:
http://www.mathworks.com/help/toolbox/stats/brj0qbu.html
Feature Selection in MATLAB
Of the above, relieff and rankfeatures look promising.
I do not know if my data follows a normal distribution. Any thoughts on which technique performs the best? Are there any newer methods you would suggest? The focus is to increase classification accuracy.
Thank you!
Since the answer is highly dependent on the nature of your data, I'd suggest playing with several options, possibly using a hold-out set for verification. The easiest path would probably be to use Weka or RapidMiner for experimenting. Choosing from the plethora of options provided by them, you'll probably get acquainted with several other methods.
Having said that, I have found Mutual Information/Infogain to be useful on a large variety of problems.