the definition of unbalanced sample

88 Views Asked by At

Unbalanced sample causes issues and more efforts as we know.

When I am handling the issue, I am confused about the definition. Say, I have a training dataset of 200 cats, 200 dogs and 400 stones. When I am to classify the dataset, when classfying 3 classesm I should have 200 cats, 200 dogs and 200 stones, what should I allocate when I am just to classify 2 classes of pets and stones?

Should I still go with 400 pets (w/ 200 cats & 200 dogs) and 400 stones? make class pets and stones has same quantities.

or should I go with 400 pets (w/ 200 cats & 200 dogs) and 200 stones? or make all inner classes have the same probability to be watched, after all, cats and dogs are essentitally different.

1

There are 1 best solutions below

1
On

I think it is task dependent, if you are going to classify your samples into two classes (pets and stones) then you must use all 400 pet images (cats and dogs) and the 400 stone samples. However, if you are having three classes: cats, dogs, and stones; then you need to limit the number of stone sample to 200 for eavery training epoch.

Why this? In the case of two classes pets vs stones: both labels (pet and stone) update the weights of the models 400 times for each epoch. So after the training finishes, the model will be able to regognize both classes equivalently.

In the case of three classes (cats, dogs, and stones) the cat and dog classes update the wights 200 times per epoch while the stone class update the weights 400 times per epoch, so the model will have a higher chance of outputing the stone class than outputing the cat or dog class.

So, in summary, you should make the number of samples the same for all classes.

PS: if you randomly select 200 stone samples from the 400 ones in the case of three classes, your model won't end up biased to the stone class compared to the other two classes, however it will generalize better on the stone class compared to the other two because it has seen more unique samples of this class.