I have big troubles implementing LightGBM on a extreme imbalanced dataset (using R)
Indeed, I'm dealing with a binary classification problem and the distibution of the target variable is about 1:800
( Approx: Class 0: 110 000 Class 1: 140 )
I have nearly 300 variables (which are summaries of dynamic variables over 12 months) and a couple of categorical variables.
In all what follows, my evaluation is the F1-score and the metric I use is the binary log-loss
I have tried 2 approaches: one with resampling techniques, one without.
Method of 1st approach
First, I have decided to LabelEncode my categorical variables (because ADASYN does not takes categorical variables into account as input)
I have tried different combination of SMOTE/ADAZYN & NearMiss/RandomUnderSampler to resample my training set
I standardize my numerical variables
I train my model on train set and predict on my validation set (without specifying the parameter scale_pos_weight for positive class in lgb.train)
I obtain some very bad results:
On train set: F1-score=0.5
On test set: F1-score=0.04
Method of 2nd approach
Same as first one but I'm not using resampling techniques on my training set.
I only set scale_pos_weight = count(negative)/count(positive) ~ 800 in my case
I have tried to tune paramters but I feel like I'm missing something since F1-score on validation set is still around 0.02..
Do you have any idea on how I could improve my model?
Thanks a lot in advance for your help !