I have been working on the Sentiment analysis prediction using the Rotten Tomatoes movie reviews dataset. The dataset has 5 classes {0,1,2,3,4} where 0 being very negative and 4 being very positive The dataset is highly unbalanced,
total samples = 156061
'0': 7072 (4.5%),
'1': 27273 (17.4%),
'2': 79583 (50.9%),
'3': 32927 (21%),
'4': 9206 (5.8%)
as you can see class 2
has almost 50% samples and 0
and 5
contribute to ~10% of training set
So there is a very strong bias for class 2
thus reducing the accuracy of classification for class 0
and 4
.
What can I do to balance the dataset? One solution would be to get equal number of samples by reducing the samples to only 7072 for each class, but it reduces the dataset drastically! How can I optimize and balance the dataset without affecting the accuracy of overall classification?
You should not balance the dataset, you should train a classifier in a balanced manner. Nearly all existing classifiers can be trained with some cost sensitive objective. For example - SVMs let you "weight" your samples, simply weight samples of the smaller class more. Similarly Naive Bayes has classes priors - change them! Random forest, Neural networks, Logistic regression, they all let you somehow "weight" samples, it is the core technique for getting more balanced results.