LightGBM fails to predict on validation set (R)

155 Views Asked by At

I have big troubles implementing LightGBM on a extreme imbalanced dataset (using R)

Indeed, I'm dealing with a binary classification problem and the distibution of the target variable is about 1:800

( Approx: Class 0: 110 000 Class 1: 140 )

I have nearly 300 variables (which are summaries of dynamic variables over 12 months) and a couple of categorical variables.

In all what follows, my evaluation is the F1-score and the metric I use is the binary log-loss

I have tried 2 approaches: one with resampling techniques, one without.

Method of 1st approach

  1. First, I have decided to LabelEncode my categorical variables (because ADASYN does not takes categorical variables into account as input)

  2. I have tried different combination of SMOTE/ADAZYN & NearMiss/RandomUnderSampler to resample my training set

  3. I standardize my numerical variables

  4. I train my model on train set and predict on my validation set (without specifying the parameter scale_pos_weight for positive class in lgb.train)

  5. I obtain some very bad results:
    On train set: F1-score=0.5
    On test set: F1-score=0.04

Method of 2nd approach

Same as first one but I'm not using resampling techniques on my training set.
I only set scale_pos_weight = count(negative)/count(positive) ~ 800 in my case

I have tried to tune paramters but I feel like I'm missing something since F1-score on validation set is still around 0.02..

Do you have any idea on how I could improve my model?

Thanks a lot in advance for your help !

0

There are 0 best solutions below