How do I balance a training dataset which has very high number of samples for a certain class?

1.1k Views Asked by At

I have been working on the Sentiment analysis prediction using the Rotten Tomatoes movie reviews dataset. The dataset has 5 classes {0,1,2,3,4} where 0 being very negative and 4 being very positive The dataset is highly unbalanced,

total samples = 156061

'0': 7072 (4.5%), '1': 27273 (17.4%), '2': 79583 (50.9%), '3': 32927 (21%), '4': 9206 (5.8%)

as you can see class 2 has almost 50% samples and 0 and 5 contribute to ~10% of training set

So there is a very strong bias for class 2 thus reducing the accuracy of classification for class 0 and 4.

What can I do to balance the dataset? One solution would be to get equal number of samples by reducing the samples to only 7072 for each class, but it reduces the dataset drastically! How can I optimize and balance the dataset without affecting the accuracy of overall classification?

2

There are 2 best solutions below

3
On

You should not balance the dataset, you should train a classifier in a balanced manner. Nearly all existing classifiers can be trained with some cost sensitive objective. For example - SVMs let you "weight" your samples, simply weight samples of the smaller class more. Similarly Naive Bayes has classes priors - change them! Random forest, Neural networks, Logistic regression, they all let you somehow "weight" samples, it is the core technique for getting more balanced results.

0
On

For classification problems, you can try class_weight='balanced' option in your estimator, such as Logistic, SVM, etc. For example:

http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html#sklearn.linear_model.LogisticRegression