I have a large training data set data.trn
of 260,000+ observations on 50+ variables , with dependent variable loan_status
consisting of 2 classes "paid off"
and "default"
with respective imbalance of about 5:1
. I want to use information.gain
command from FSelector
package to reduce features to the most meaningful. However, I am afraid this filtering method left as it is will be biased towards the majority class, leading to misleading assessment of the features. To avoid this, I figured a form of sapply
based procedure could mitigate the issue by extracting the mean of several information gain tests on 10 different balanced cross validation folds. I imagined the folds could be constructed by taking all the minority class observations each time and paired with different equal amounts of observations from the majority class. However, the problem is, I am a beginner in R, so I am not quite adept at creating such structures on my own, so I thought someone here could kindly show me how it can be done, because I still cannot get my head around the task. As of now I only did the basic information gain test and do not know how to make the desired balanced CV version of it:
info_gain <- FSelector::information.gain(loan_status ~ ., data.trn)
I would recomend one of these 2 strategies:
Sample a subset of the majority class , down to a number that is more in line with the smaller classes. Repeat this multiple times, each time record important features. Then see if there are features that are consistently among the most important ones across all sets.
Resample the smaller classes to get synthecically inflated sample numbers. Essentially estimate their covariance, sample random samples from that, fit the model on this data (and remove the sample before estimating performance). So in a sense you're just borrowing synthetic data to stabilize the model fitting procedure.
The first one is perhaps the less complicated.
Here's a simple demonstration on approach 1:
If you are interested in approach 2, I would recomend looking into the
caret
package which does this easily, though I don't know if they have suppot for your particular method.