Questions of handling imbalance dataset classification

21 Views Asked by At

I am trying to predict number of members who will discontinue their membership. The whole dataset is about 12 millions rows of data with about 40 columns. A member status can be “Continue”, “Voluntary Discontinue” or “Involuntary Discontinue”. This dataset is highly imbalanced where 98% of member are “Continue”, about 1% for “Voluntary Discontinue” and “Involuntary Discontinue”. To reduce dimensionality, I have ran correlation analysis to select only 15 features with highest correlation for modelling.

Below are the problems I am facing:

  1. My colleagues used multinomial regression. However, he did not apply a threshold to convert probabilities into class label. Instead, he summed up all probabilities of individual members to estimate the number of predicted members who will Voluntary Discontinue or Involuntary Discontinue. I am not sure about this approach because I don’t quite get the meaning after summing individual probability instead of using a threshold. Is this approach correct given that we are interested in the total number of people. Also, how do we measure model performance with this approach

  2. I am treating this question as a classification problem. As it is imbalanced dataset and we are interested in the people who will discontinue, I have tried with SMOTE and undersampling method. However, trying with logistic regression, decision tree and neural network, they all still have very low precision for class “Voluntary Discontinue” and “Involuntary Discontinue”. Any other ways I can increase the precision of minority class?

  3. I tried to run random forest. However, it failed to run due to memory limitation. Any suggestions to tackle this kind of large dataset?

0

There are 0 best solutions below