How to jointly use makeFeatSelWrapper and resample function in mlr

215 Views Asked by At

I'm fitting classification models for binary issues using MLR package in R. For each model, I perform a cross-validation with embedded feature selection using "selectFeatures" function. In output, I retrieve mean AUCs over test sets and predictions. To do so, after having get some advices (Get predictions on test sets in MLR), I use "makeFeatSelWrapper" function in combination with "resample" function. The goal seems to be achieved but results are strange. With a logistic regression as classifier, I get an AUC of 0.5 which means no variable selected. This result is unexpected as I get an AUC of 0.9824432 with this classifier using the method mentioned in the linked question. With a neural network as classifier, I get an error message

Error in sum(x) : invalid 'type' (list) of argument

What is wrong?

Here is the sample code:

# 1. Find a synthetic dataset for supervised learning (two classes)
###################################################################

install.packages("mlbench")
library(mlbench)
data(BreastCancer)

# generate 1000 rows, 21 quantitative candidate predictors and 1 target variable 
p<-mlbench.waveform(1000) 

# convert list into dataframe
dataset<-as.data.frame(p)

# drop thrid class to get 2 classes
dataset2  = subset(dataset, classes != 3)

# 2. Perform cross validation with embedded feature selection using logistic regression
#######################################################################################  

library(BBmisc)
library(nnet)
library(mlr)

# Choice of data 
mCT <- makeClassifTask(data =dataset2, target = "classes")

# Choice of algorithm i.e. neural network
mL <- makeLearner("classif.logreg", predict.type = "prob")

# Choice of cross-validations for folds 

outer = makeResampleDesc("CV", iters = 10,stratify = TRUE)

# Choice of feature selection method

ctrl = makeFeatSelControlSequential(method = "sffs", maxit = NA,alpha = 0.001)

# Choice of hold-out sampling between training and test within the fold

inner = makeResampleDesc("Holdout",stratify = TRUE)

lrn = makeFeatSelWrapper(mL, resampling = inner, control = ctrl)
r = resample(lrn, mCT, outer, extract = getFeatSelResult,measures = list(mlr::auc,mlr::acc,mlr::brier),models=TRUE)

# 3. Perform cross validation with embedded feature selection using neural network
##################################################################################

library(BBmisc)
library(nnet)
library(mlr)

# Choice of data 
mCT <- makeClassifTask(data =dataset2, target = "classes")

# Choice of algorithm i.e. neural network
mL <- makeLearner("classif.nnet", predict.type = "prob")

# Choice of cross-validations for folds 

outer = makeResampleDesc("CV", iters = 10,stratify = TRUE)

# Choice of feature selection method

ctrl = makeFeatSelControlSequential(method = "sffs", maxit = NA,alpha = 0.001)

# Choice of sampling between training and test within the fold

inner = makeResampleDesc("Holdout",stratify = TRUE)

lrn = makeFeatSelWrapper(mL, resampling = inner, control = ctrl)
r = resample(lrn, mCT, outer, extract = getFeatSelResult,measures = list(mlr::auc,mlr::acc,mlr::brier),models=TRUE)
1

There are 1 best solutions below

1
On BEST ANSWER

If you run your logistic regression part of the code a couple of times, you should also get the Error in sum(x) : invalid 'type' (list) of argument error. However, I find it strange that fixing a particular seed (e.g., set.seed(1)) before resampling does not ensure that the error does or does not appear.

The error occurs in internal mlr code for printing the output of feature selection to the console. A very simple workaround is to simply avoid printing such output with show.info = FALSE in makeFeatSelWrapper (see code below). While this removes the error, it is possible that what caused it may have other consequences, although I it is possible the error only affects the printing code.

When running your code, I only get AUC above 0.90. Please find below a your code for logistic regression, slightly re-organized and with the workaround. I have added a droplevels() to the dataset2 to remove the missing level 3 from the factor, though this is not related with the workaround.

library(mlbench)
library(mlr)
data(BreastCancer)

p<-mlbench.waveform(1000)
dataset<-as.data.frame(p)
dataset2  = subset(dataset, classes != 3)
dataset2  <- droplevels(dataset2  )    

mCT <- makeClassifTask(data =dataset2, target = "classes")
ctrl = makeFeatSelControlSequential(method = "sffs", maxit = NA,alpha = 0.001)
mL <- makeLearner("classif.logreg", predict.type = "prob")
inner = makeResampleDesc("Holdout",stratify = TRUE)
lrn = makeFeatSelWrapper(mL, resampling = inner, control = ctrl, show.info = FALSE)
# uncomment this for the error to appear again. Might need to run the code a couple of times to see the error
# lrn = makeFeatSelWrapper(mL, resampling = inner, control = ctrl)
outer = makeResampleDesc("CV", iters = 10,stratify = TRUE)
r = resample(lrn, mCT, outer, extract = getFeatSelResult,measures = list(mlr::auc,mlr::acc,mlr::brier),models=TRUE)

Edit: I've reported an issue and created a pull request with a fix.