I am currently having an imbalanced dataset as shown diagram below:
Then, I use the 'is_unbalance' parameter by setting it to True
when training the LightGBM model. Diagrams below show how I use this parameter.
Example of using sckit-learnAPI:
My questions are:
- Is the way I apply the use of
is_unbalance
parameter correct? - How to use
scale_pos_weight
instead ofis_unbalance
? - Or I should balance the dataset using SMOTE techniques like SMOTE-ENN or SMOTE+TOME?
Thanks!
This answer might be good for you question about is_unbalance: Use of 'is_unbalance' parameter in Lightgbm
You're not necessarily using the is_unbalance incorrectly, but sample_pos_weight will provide you a better control of weights of minority and majority class.
At this link there is a good explanation about the scale_pos_weight use: https://stats.stackexchange.com/questions/243207/what-is-the-proper-usage-of-scale-pos-weight-in-xgboost-for-imbalanced-datasets
Basically, the scale_pos_weight allows to set a configurable weight for the minority class, as a target variable. A good discussion about this topic is here https://discuss.xgboost.ai/t/how-does-scale-pos-weight-affect-probabilities/1790/4.
About the SMOTE, I can't provide you theoretical proof about it, but considering my experience, everytime I tried to use it to improve any model performance using SMOTE, it failed.
A better approach might be to decide carefully which metric must be optimized. Better metrics for unbalanced problems are f1-score and also recall. In general, AUC, and Accuracy will be a bad choice. Also the -micro and weighted metrics are good metrics to use as objective when searching for hyperparameters)
Machine Learning Mastery provides a good explanation and implementation code about micro, macro and weighted metrics: https://machinelearningmastery.com/precision-recall-and-f-measure-for-imbalanced-classification/