I am currently having an imbalanced dataset as shown diagram below: Distribution of target feature

Then, I use the 'is_unbalance' parameter by setting it to True when training the LightGBM model. Diagrams below show how I use this parameter.

Example of using native API: Example of sing native API

Example of using sckit-learnAPI: Example of using sckit-learnAPI

My questions are:

  1. Is the way I apply the use of is_unbalance parameter correct?
  2. How to use scale_pos_weight instead of is_unbalance?
  3. Or I should balance the dataset using SMOTE techniques like SMOTE-ENN or SMOTE+TOME?

Thanks!

1

There are 1 best solutions below

0
On

This answer might be good for you question about is_unbalance: Use of 'is_unbalance' parameter in Lightgbm

You're not necessarily using the is_unbalance incorrectly, but sample_pos_weight will provide you a better control of weights of minority and majority class.

At this link there is a good explanation about the scale_pos_weight use: https://stats.stackexchange.com/questions/243207/what-is-the-proper-usage-of-scale-pos-weight-in-xgboost-for-imbalanced-datasets

Basically, the scale_pos_weight allows to set a configurable weight for the minority class, as a target variable. A good discussion about this topic is here https://discuss.xgboost.ai/t/how-does-scale-pos-weight-affect-probabilities/1790/4.

About the SMOTE, I can't provide you theoretical proof about it, but considering my experience, everytime I tried to use it to improve any model performance using SMOTE, it failed.

A better approach might be to decide carefully which metric must be optimized. Better metrics for unbalanced problems are f1-score and also recall. In general, AUC, and Accuracy will be a bad choice. Also the -micro and weighted metrics are good metrics to use as objective when searching for hyperparameters)

Machine Learning Mastery provides a good explanation and implementation code about micro, macro and weighted metrics: https://machinelearningmastery.com/precision-recall-and-f-measure-for-imbalanced-classification/