I'm trying to run a text classification model on some text data (Tweets) using sklearn and Python. I have hand coded near 1.5k cases, however the data is imbalanced.
Cases are coded for themes. One of the codes is essentially 'no theme', and is the majority of the cases.
To be precise the data has:
964 no theme tweets
183 theme A
171 theme B
120 theme C
110 theme D
98 theme E
Unfortunately my models (both SVM and Logistic Regression) seem to consistently produce false positives for the no themes, suggesting imbalanced data is the problem.
I looked into imbalanced data advice, but couldn't come to a satisfactory answer.
Is there a good way to deal with imbalanced data in multi-classification problems? What about when the imbalance largely comes from an 'other'/null category?
I've seen people suggest over-sampling the data. Isn't this highly likely to artificially overfit the data and just inflate accuracy, as you would be trying to predict a case from an identical case?
I've seen some people suggest SMOTE? Can smote be used on text-classification?
Any other advice in general for how to proceed?
You can use weights => Many classifiers, including SVM and Logistic Regression in scikit-learn, allow you to assign different weights to different classes.
You can choose an appropriate model => Decision trees and random forests for example can work well despite imbalanced data.
You can use evaluation metrics => such as precision, recall, F1 score, ROC
Try to research about them first, if you struggle I will help you with implementation.