This is my first data mining project. I am using SAS Enterprise miner to train and test a classifier.
I have 3 files at my disposal,
- Training file : 85 input variables and 1 target variable, with 5800+ observations
- Prediction file : 85 input variables with 4000 observations
- Verification file : 1 variable containing the correct predictions for the second file. Since this is an academic project, this file is here to tell us if we are doing a good job or not.
My problem is that the dataset is unbalanced (95% of 0s and 5% of 1s for the target variable in the training file). So naturally, I tried to re-sample the model using the "sampling node" as described in the following link
Here are the 2 approaches I used, they give slightly different results. But here is the general unsatisfactory result I am getting:
- Without resampling : The model predicts less than ten solicited individuals (target variable = 1) over 4000 observations
- With the resampling : The model predicts about 1500 solicited individuals over 4000 observations.
I am looking for 100 to 200 solicited individuals to have a model that would be considered acceptable.
Why do you think our predictions are way off this way, and how can we remedy to this situation?
Here is a screen shot of both models
There are some Technics to deal with unbalanced data. One that I remember many years ago was this approach: