How do I balance a training dataset which has very high number of samples for a certain class?

1.1k Views Asked by Salil Navgire At 29 July 2025 at 07:02

I have been working on the Sentiment analysis prediction using the Rotten Tomatoes movie reviews dataset. The dataset has 5 classes {0,1,2,3,4} where 0 being very negative and 4 being very positive The dataset is highly unbalanced,

total samples = 156061

'0': 7072 (4.5%), '1': 27273 (17.4%), '2': 79583 (50.9%), '3': 32927 (21%), '4': 9206 (5.8%)

as you can see class 2 has almost 50% samples and 0 and 5 contribute to ~10% of training set

So there is a very strong bias for class 2 thus reducing the accuracy of classification for class 0 and 4.

What can I do to balance the dataset? One solution would be to get equal number of samples by reducing the samples to only 7072 for each class, but it reduces the dataset drastically! How can I optimize and balance the dataset without affecting the accuracy of overall classification?

Original Q&A

There are 2 best solutions below

lejlot On 18 November 2014 at 21:34

You should not balance the dataset, you should train a classifier in a balanced manner. Nearly all existing classifiers can be trained with some cost sensitive objective. For example - SVMs let you "weight" your samples, simply weight samples of the smaller class more. Similarly Naive Bayes has classes priors - change them! Random forest, Neural networks, Logistic regression, they all let you somehow "weight" samples, it is the core technique for getting more balanced results.

Paul On 01 November 2017 at 18:31

For classification problems, you can try class_weight='balanced' option in your estimator, such as Logistic, SVM, etc. For example:

http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html#sklearn.linear_model.LogisticRegression

How do I balance a training dataset which has very high number of samples for a certain class?

There are 2 best solutions below

Related Questions in MACHINE-LEARNING

Related Questions in DATASET

Related Questions in SCIKIT-LEARN

Related Questions in RANDOM-FOREST

Related Questions in SAMPLING

Trending Questions

Popular # Hahtags

Popular Questions