Dealing with highly imbalanced datasets using Tensorflow Dataset and Keras Tuner

1k Views Asked by At

I have a highly imbalanced dataset (3% Yes, 87% No) of textual documents, containing a title and abstract feature. I have transformed these documents into entities with padded batches. Now, I am trying to train this dataset using Deep Learning. With in TensorFlow, you have the class_weights parameter to deal with class imbalance, however, I am seeking for the best parameters using keras-tuner library. In their hyperparameter tuners, they do not have such an option. Therefore, I am seeking other options for dealing with class imbalance.

Is there an option to use class weights in keras-tuner? To add, I am already using the precision@recall metric. I could also try a data resampling method, such as imblearn.over_sampling.SMOTE, but as this Kaggle post mentions:

It appears that SMOTE does not help improve the results. However, it makes the network learning faster. Moreover, there is one big problem, this method is not compatible larger datasets. You have to apply SMOTE on embedded sentences, which takes way too much memory.


There are 2 best solutions below


You could change the evaluation metric to fbeta_scorer.(its weighted fscore)

Or if the dataset is large enough, you can try undersampling.


if you are looking for other methods to deal with imbalanced data, you may consider generating synthetic data using SMOTE or ADASYN package. This usually works. I see you have considered this as an option to explore.