Unbalanced Dataset: Oversampling vs Repeat

534 Views Asked by Valerio Storch At 26 November 2025 at 09:38

I am facing a Machine Learning Task on a highly unbalanced dataset.

Since the smallest class has a tiny number of examples (something like 2 hundreds w.r.t to the biggest that is 200 thousand). I need to perform oversampling (to be more precise I would oversample the smaller classes, and undersample the bigger ones to an intermediate value of examples, but this is out of the scope of this question).

Now, I have two options to do that:

1) Random sample (of course with replacement) examples from the smallest class

2) Repeat n times the examples from the smallest class

Any advice on which is the best way?

Thanks in advance.

Original Q&A

There are 1 best solutions below

Florian H On 07 September 2017 at 07:47

As Mohammed Athar mentioned you can just try which of your menioned approaches works better.

Additionally you could try to split your "large" class into (large_class/small_class) splits randomly! Then you train a classifier (where you have all data from the small class and only a part of the large class) for every split you have.

At the end you can combine all your classifiers with bagging/boosting/neural_network/other_model.

Unbalanced Dataset: Oversampling vs Repeat

There are 1 best solutions below

Related Questions in MACHINE-LEARNING

Related Questions in ARTIFICIAL-INTELLIGENCE

Related Questions in CLASSIFICATION

Related Questions in RESAMPLING

Trending Questions

Popular # Hahtags

Popular Questions