Why isn't my classifier predicting any positive classes?

789 Views Asked by At

I'm doing sentiment analysis on tweets containing the word "Trump". I manually labeled the first 200 tweets, here are the first 13 observations:

Date    SentimentText   Sentiment
Mon Nov 28 23:24:12 +0000 2016  "@HillaryClinton Go ahead with your hypocritical recount. It's fun to watch Trump squirm."  0
Mon Nov 28 23:39:06 +0000 2016  @SenSchumer & @SenGillibrand - Demand Trump rescind Steve Bannon's appointment. @MoveOn 0
Mon Nov 28 23:30:34 +0000 2016  Democrats Demand Trump's Tax Returns And An Investigation Into His Conflicts Of Interest via @politicususa  0
Mon Nov 28 23:54:43 +0000 2016  "Oh my god, how has this only been one day?" -@SaraMurray on covering a day on the Trump Trail #girlsonthebus @gupolitics   0
Mon Nov 28 23:18:16 +0000 2016  People are mad at GiGi for impersonating Melania Trump, saying "it's rude to bully and immigrant" OH?! THE FUCKING IRONY    0
Mon Nov 28 23:50:10 +0000 2016  @dosdelimas @FoxNews mt @resnikoff For those who don't understand why Trump would lie about voter fraud ..  0
Mon Nov 28 23:29:29 +0000 2016  @tanveerali Yo! Do you mind if I steal your awesome electoral map (giving credit where credit is due)?  1
Mon Nov 28 23:19:39 +0000 2016  "Historic," as in lower 1/3 of all EV results in American History   1
Mon Nov 28 23:41:40 +0000 2016  i thought this was gonna say trump before i opened it   0
Mon Nov 28 23:13:31 +0000 2016  Hold on wait, I voted for trump is the new racial slur now? im dead 1
Mon Nov 28 23:22:01 +0000 2016  O.K., well, if a mass of stuff was then taught, it was set up for. #SubhumanCheeto #NMP 0
Mon Nov 28 23:44:13 +0000 2016  Woman goes on racist, pro-Trump tirade in Michaels store over $1 bag  Trumpmerica ladies & gents.   0

I tried labeling the tweets based on whether the user supports Trump or posted something positive about him. This is the code that I'm using thus far:

import numpy as np
import pandas as pd
import csv
from sklearn import linear_model, naive_bayes
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
from sklearn.preprocessing import FunctionTransformer
from sklearn.pipeline import Pipeline
from sklearn.metrics import precision_score
from sklearn.metrics import recall_score
from sklearn.metrics import f1_score
from sklearn.base import TransformerMixin
from sklearn import cross_validation



logistic_clf = Pipeline([('vect', CountVectorizer(ngram_range=(1, 2))),
                     ('ft_vec', FunctionTransformer(lambda x: x.todense(), accept_sparse=True)),
                     ('tfidf', TfidfTransformer()),
                     ('ft_tfid', FunctionTransformer(lambda x: x.todense(), accept_sparse=True)),
                     ('clf', linear_model.LogisticRegression(penalty='l2',solver='lbfgs',max_iter=1000, multi_class='ovr',warm_start=True)),
                    ])

gnb_clf = Pipeline([('vect', CountVectorizer()),
                     ('ft_vec', FunctionTransformer(lambda x: x.todense(), accept_sparse=True)),
                     ('tfidf', TfidfTransformer()),
                     ('ft_tfid', FunctionTransformer(lambda x: x.todense(), accept_sparse=True)),
                     ('clf', naive_bayes.GaussianNB()),
                    ])


import csv
from pandas import *

df = read_excel('trump_labeled.xlsx')

#Collect the output in y variable

y = df['Sentiment']
X = df['SentimentText']
from sklearn.cross_validation import train_test_split
#cross validation
X_train, X_test,y_train, y_test = train_test_split(X,y,test_size=0.25, random_state=42)
X_train = np.array(X_train)
y_train = np.array(y_train)
X_test = np.array(X_test)
y_test = np.array(y_test)



log_clf = logistic_clf.fit(X_train, y_train)
gnb_clf = gnb_clf.fit(X_train, y_train)

log_predicted = logistic_clf.predict(X_test) # predict labels for test data with logistic regression classifier
gnb_predicted = gnb_clf.predict(X_test) # predict labels for test data with naive bayes classifier

# PRINT SOME RESULTS FOR THE DATASETS PART
print("\nDATASET RESULTS")
print('\nLogistic Regression Results:\n\tNegative tweets: %.2f\n\tPositive tweets: %.2f' %(np.mean(log_predicted == 0), np.mean(log_predicted == 1)))
print('\tAccuracy: %.2f'% (np.mean(log_predicted == y_test)))
print('\tPositive Precision: %.2f' %(precision_score(y_test, log_predicted,pos_label=1)))
print('\tPositive Recall: %.2f' %(recall_score(y_test, log_predicted,pos_label=1)))
print('\tPositive F-measure: %.2f' %(f1_score(y_test, log_predicted,pos_label=1)))
print('\tNegative Precision: %.2f' %(precision_score(y_test, log_predicted,pos_label=0)))
print('\tNegative Recall: %.2f' %(recall_score(y_test, log_predicted,pos_label=0)))
print('\tNegative F-measure: %.2f' %(f1_score(y_test, log_predicted,pos_label=0)))

this generated the following results

DATASET RESULTS

Logistic Regression Results:
        Negative tweets: 1.00
        Positive tweets: 0.00
        Accuracy: 0.72
        Positive Precision: 0.00
        Positive Recall: 0.00
        Positive F-measure: 0.00
        Negative Precision: 0.72
        Negative Recall: 1.00
        Negative F-measure: 0.84
C:\Users\My\Anaconda2\lib\sitepackages\sklearn\metrics\classification.py:1074: UndefinedMetricWarning: Precision is ill-defined and being set to 0.0 due to no predicted samples.'precision', 'predicted', average, warn_for)
C:\Users\My\Anaconda2\lib\sitepackages\sklearn\metrics\classification.py:1074: UndefinedMetricWarning: F-score is ill-defined and being set to 0.0 due to no predicted samples.'precision', 'predicted', average, warn_for)

My classifiers(logistic regression and Naive Bayes) fail to accurately classifying the positive labels, where positive = 1, which makes my evaluation metrics ill defined. Of the 200 tweets, 43 were positive, however my classifier is classifying all 200 of them negative. How could I fix this? Note, I still have not preprocessed my data. So I still need to convert url's to the token url, account for whitespaces, etc. Is it because I haven't preprocessed my tweets yet? Or perhaps the way I manually labeled the tweets, some of them were difficult to decide whether they were positive or not. I tried search for prelabeled tweets on Trump and had no luck

Also, I noticed accounting for L2 regularization and the BFGS optimization method for my logistic regression does nothing to change my accuracy, is that normal?

2

There are 2 best solutions below

1
On

My guess is that you have too small of a dataset for this problem and These classifiers. To exasperate the situation the the dataset is strongly imbalanced. So for these classifiers and choice of loss (you're using cross-entropy) least penalization happens for predicting the dominant class in this scenario.

My suggestions would be to get more data ( label more tweets in this case )

3
On

I think you have two things going on here. First, as hit on above, your corpora of training tweets is too small, especially when you consider that many sentiment analysis codes want to vectorize the words used. These vectors are very large and very sparse. If your number of labeled tweets for training is not on par with the number of words you are training against, you will have an ill-posed solution.

The good news is that there are a few decent training sets out there for Twitter (at least in English) that are already labeled. There are several referenced at Sentiment140.

The bad news is that it is a pretty bad idea to try and do sentiment analysis without cleaning up the text significantly. This means parsing your tweet into words (after you have removed links, emoticons, hashtags, etc.), eliminating stop words, and then deciding from their whether you want to hash your data or vectorize it in some way. Until you do those things, the very few words that you have in your training set will be lost in the noise.