MLR package: generateFilterValuesData chi.squared and information.gain

337 Views Asked by At

I am experimenting with the mlr package and would like to get chi-squared and information-gain values.

library(mlr)
library(FSelector)

data(PimaIndiansDiabetes)
indi <- sample(1:nrow(PimaIndiansDiabetes), 0.6 * nrow(PimaIndiansDiabetes))
train <- PimaIndiansDiabetes[indi,]

trainTask <- makeClassifTask(data = train, target = "diabetes", positive = "pos")

#Feature importance
im_feat <- generateFilterValuesData(trainTask, method = c("information.gain","chi.squared"))
plotFilterValues(im_feat)
im_feat

I am not sure about the consequences that there are two zeros in information.gain and chi.squared for the variables triceps and pressure. Does that indicate I should not use them for setting up a model (e.g. random forest)?

When I use

tbl <- table(train$triceps, train$diabetes)
chisq.test(tbl)

it gives me 60.473 for chi-squared. Why is it not 0? What's the difference between chisq and the chi-squared-method from mlr?

1

There are 1 best solutions below

0
On

Regarding your first question, values of 0 generally indicate that the feature is not predictive wrt the variable that you're interested, based on the particular evaluation method that you applied. This does not necessarily mean that the same is true for a particular type of model, and hence it usually doesn't make sense to remove it. Apart from that, many models perform feature selection internally (one of these being random forests), so this kind of preprocessing doesn't make sense in general, unless you have so many features that a random forest takes too long to build a model, for example.

The chi.squared test in mlr and chi.sq are based on different implementations; not sure why they're not returning the same result.