I have a large training data set data.trn of 260,000+ observations on 50+ variables , with dependent variable loan_status consisting of 2 classes "paid off" and "default" with respective imbalance of about 5:1. I want to use information.gain command from FSelector package to reduce features to the most meaningful. However, I am afraid this filtering method left as it is will be biased towards the majority class, leading to misleading assessment of the features. To avoid this, I figured a form of sapply based procedure could mitigate the issue by extracting the mean of several information gain tests on 10 different balanced cross validation folds. I imagined the folds could be constructed by taking all the minority class observations each time and paired with different equal amounts of observations from the majority class. However, the problem is, I am a beginner in R, so I am not quite adept at creating such structures on my own, so I thought someone here could kindly show me how it can be done, because I still cannot get my head around the task. As of now I only did the basic information gain test and do not know how to make the desired balanced CV version of it:

info_gain <- FSelector::information.gain(loan_status ~ ., data.trn)

1

There are 1 best solutions below

0
On

I would recomend one of these 2 strategies:

  1. Sample a subset of the majority class , down to a number that is more in line with the smaller classes. Repeat this multiple times, each time record important features. Then see if there are features that are consistently among the most important ones across all sets.

  2. Resample the smaller classes to get synthecically inflated sample numbers. Essentially estimate their covariance, sample random samples from that, fit the model on this data (and remove the sample before estimating performance). So in a sense you're just borrowing synthetic data to stabilize the model fitting procedure.

The first one is perhaps the less complicated.

Here's a simple demonstration on approach 1:


## Using the `mpg` dataset, pretending the 'dri' column is of particular interest to us.
##
## 'drv' is a column with three levels, that are not very balanced:
##
## table( mpg$drv )
##   4   f   r
## 103 106  25

## Let's sub-sample 25 of each class, it makes sense from the table above
n.per.class  <- 25

## let's do the sampling 10 times
n.times <- 10

library(foreach) ## for parallell work
library(doMC)
registerDoMC()

unique.classes <- unique( mpg$drv ) ## or just use levels( mpg$drv ) if you have a factor

variable.importances <- foreach( i=1:n.times ) %dopar% {

    j <- sapply(
        unique.classes,
        function(cl.name) {
            sample( which( mpg$drv == cl.name ), size=n.per.class )
        },
        simplify=FALSE
    )

    ## 'j' is now a named list, we easily turn it to a vector with unlist:
    sub.data <- mpg[ unlist(j), ]

    ## table( sub.data$drv )
    ##  4  f  r
    ## 25 25 25
    ##
    ## 25 of each!


    fit <- train.your.model()
    varimp <- variable.importance( fit )

    ## I don't know what varimp will look like in your case, is it a numeric vector perhaps?

}

## variable.importances now is a list with the result of each
## iteration. If it is a vector wiht number for example, the following
## could be useful to have those collected into a matrix:

matrix.of.variable.importances <- Reduce( rbind, variable.importances )
colnames( matrix.of.variable.importances ) <- colnames( your.data )

If you are interested in approach 2, I would recomend looking into the caret package which does this easily, though I don't know if they have suppot for your particular method.