How do you find the most discriminant terms in binary document classification?

246 Views Asked by Sau At 29 July 2025 at 17:49

I want to use feature selection to find the terms in a document that are most useful for a binary classification task.

I've been looking around:
This mentions Mutual Information and the chi-squared test metric
http://nlp.stanford.edu/IR-book/html/htmledition/feature-selection-1.html

MATLAB has a number of functions as well:
http://www.mathworks.com/help/toolbox/stats/brj0qbu.html
Feature Selection in MATLAB
Of the above, relieff and rankfeatures look promising.

I do not know if my data follows a normal distribution. Any thoughts on which technique performs the best? Are there any newer methods you would suggest? The focus is to increase classification accuracy.

Thank you!

Original Q&A

There are 1 best solutions below

etov On 23 November 2011 at 07:14

Since the answer is highly dependent on the nature of your data, I'd suggest playing with several options, possibly using a hold-out set for verification. The easiest path would probably be to use Weka or RapidMiner for experimenting. Choosing from the plethora of options provided by them, you'll probably get acquainted with several other methods.

Having said that, I have found Mutual Information/Infogain to be useful on a large variety of problems.

How do you find the most discriminant terms in binary document classification?

There are 1 best solutions below

Related Questions in DOCUMENT-CLASSIFICATION

Related Questions in FEATURE-SELECTION

Trending Questions

Popular # Hahtags

Popular Questions