Unbalanced Dataset: Oversampling vs Repeat

499 Views Asked by At

I am facing a Machine Learning Task on a highly unbalanced dataset.

Since the smallest class has a tiny number of examples (something like 2 hundreds w.r.t to the biggest that is 200 thousand). I need to perform oversampling (to be more precise I would oversample the smaller classes, and undersample the bigger ones to an intermediate value of examples, but this is out of the scope of this question).

Now, I have two options to do that:

1) Random sample (of course with replacement) examples from the smallest class

2) Repeat n times the examples from the smallest class

Any advice on which is the best way?

Thanks in advance.

1

There are 1 best solutions below

0
On

As Mohammed Athar mentioned you can just try which of your menioned approaches works better.

Additionally you could try to split your "large" class into (large_class/small_class) splits randomly! Then you train a classifier (where you have all data from the small class and only a part of the large class) for every split you have.

At the end you can combine all your classifiers with bagging/boosting/neural_network/other_model.