I'm doing sentiment analysis on tweets containing the word "Trump". I manually labeled the first 200 tweets, here are the first 13 observations:
Date SentimentText Sentiment
Mon Nov 28 23:24:12 +0000 2016 "@HillaryClinton Go ahead with your hypocritical recount. It's fun to watch Trump squirm." 0
Mon Nov 28 23:39:06 +0000 2016 @SenSchumer & @SenGillibrand - Demand Trump rescind Steve Bannon's appointment. @MoveOn 0
Mon Nov 28 23:30:34 +0000 2016 Democrats Demand Trump's Tax Returns And An Investigation Into His Conflicts Of Interest via @politicususa 0
Mon Nov 28 23:54:43 +0000 2016 "Oh my god, how has this only been one day?" -@SaraMurray on covering a day on the Trump Trail #girlsonthebus @gupolitics 0
Mon Nov 28 23:18:16 +0000 2016 People are mad at GiGi for impersonating Melania Trump, saying "it's rude to bully and immigrant" OH?! THE FUCKING IRONY 0
Mon Nov 28 23:50:10 +0000 2016 @dosdelimas @FoxNews mt @resnikoff For those who don't understand why Trump would lie about voter fraud .. 0
Mon Nov 28 23:29:29 +0000 2016 @tanveerali Yo! Do you mind if I steal your awesome electoral map (giving credit where credit is due)? 1
Mon Nov 28 23:19:39 +0000 2016 "Historic," as in lower 1/3 of all EV results in American History 1
Mon Nov 28 23:41:40 +0000 2016 i thought this was gonna say trump before i opened it 0
Mon Nov 28 23:13:31 +0000 2016 Hold on wait, I voted for trump is the new racial slur now? im dead 1
Mon Nov 28 23:22:01 +0000 2016 O.K., well, if a mass of stuff was then taught, it was set up for. #SubhumanCheeto #NMP 0
Mon Nov 28 23:44:13 +0000 2016 Woman goes on racist, pro-Trump tirade in Michaels store over $1 bag Trumpmerica ladies & gents. 0
I tried labeling the tweets based on whether the user supports Trump or posted something positive about him. This is the code that I'm using thus far:
import numpy as np
import pandas as pd
import csv
from sklearn import linear_model, naive_bayes
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
from sklearn.preprocessing import FunctionTransformer
from sklearn.pipeline import Pipeline
from sklearn.metrics import precision_score
from sklearn.metrics import recall_score
from sklearn.metrics import f1_score
from sklearn.base import TransformerMixin
from sklearn import cross_validation
logistic_clf = Pipeline([('vect', CountVectorizer(ngram_range=(1, 2))),
('ft_vec', FunctionTransformer(lambda x: x.todense(), accept_sparse=True)),
('tfidf', TfidfTransformer()),
('ft_tfid', FunctionTransformer(lambda x: x.todense(), accept_sparse=True)),
('clf', linear_model.LogisticRegression(penalty='l2',solver='lbfgs',max_iter=1000, multi_class='ovr',warm_start=True)),
])
gnb_clf = Pipeline([('vect', CountVectorizer()),
('ft_vec', FunctionTransformer(lambda x: x.todense(), accept_sparse=True)),
('tfidf', TfidfTransformer()),
('ft_tfid', FunctionTransformer(lambda x: x.todense(), accept_sparse=True)),
('clf', naive_bayes.GaussianNB()),
])
import csv
from pandas import *
df = read_excel('trump_labeled.xlsx')
#Collect the output in y variable
y = df['Sentiment']
X = df['SentimentText']
from sklearn.cross_validation import train_test_split
#cross validation
X_train, X_test,y_train, y_test = train_test_split(X,y,test_size=0.25, random_state=42)
X_train = np.array(X_train)
y_train = np.array(y_train)
X_test = np.array(X_test)
y_test = np.array(y_test)
log_clf = logistic_clf.fit(X_train, y_train)
gnb_clf = gnb_clf.fit(X_train, y_train)
log_predicted = logistic_clf.predict(X_test) # predict labels for test data with logistic regression classifier
gnb_predicted = gnb_clf.predict(X_test) # predict labels for test data with naive bayes classifier
# PRINT SOME RESULTS FOR THE DATASETS PART
print("\nDATASET RESULTS")
print('\nLogistic Regression Results:\n\tNegative tweets: %.2f\n\tPositive tweets: %.2f' %(np.mean(log_predicted == 0), np.mean(log_predicted == 1)))
print('\tAccuracy: %.2f'% (np.mean(log_predicted == y_test)))
print('\tPositive Precision: %.2f' %(precision_score(y_test, log_predicted,pos_label=1)))
print('\tPositive Recall: %.2f' %(recall_score(y_test, log_predicted,pos_label=1)))
print('\tPositive F-measure: %.2f' %(f1_score(y_test, log_predicted,pos_label=1)))
print('\tNegative Precision: %.2f' %(precision_score(y_test, log_predicted,pos_label=0)))
print('\tNegative Recall: %.2f' %(recall_score(y_test, log_predicted,pos_label=0)))
print('\tNegative F-measure: %.2f' %(f1_score(y_test, log_predicted,pos_label=0)))
this generated the following results
DATASET RESULTS
Logistic Regression Results:
Negative tweets: 1.00
Positive tweets: 0.00
Accuracy: 0.72
Positive Precision: 0.00
Positive Recall: 0.00
Positive F-measure: 0.00
Negative Precision: 0.72
Negative Recall: 1.00
Negative F-measure: 0.84
C:\Users\My\Anaconda2\lib\sitepackages\sklearn\metrics\classification.py:1074: UndefinedMetricWarning: Precision is ill-defined and being set to 0.0 due to no predicted samples.'precision', 'predicted', average, warn_for)
C:\Users\My\Anaconda2\lib\sitepackages\sklearn\metrics\classification.py:1074: UndefinedMetricWarning: F-score is ill-defined and being set to 0.0 due to no predicted samples.'precision', 'predicted', average, warn_for)
My classifiers(logistic regression and Naive Bayes) fail to accurately classifying the positive labels, where positive = 1, which makes my evaluation metrics ill defined. Of the 200 tweets, 43 were positive, however my classifier is classifying all 200 of them negative. How could I fix this? Note, I still have not preprocessed my data. So I still need to convert url's to the token url, account for whitespaces, etc. Is it because I haven't preprocessed my tweets yet? Or perhaps the way I manually labeled the tweets, some of them were difficult to decide whether they were positive or not. I tried search for prelabeled tweets on Trump and had no luck
Also, I noticed accounting for L2 regularization and the BFGS optimization method for my logistic regression does nothing to change my accuracy, is that normal?
My guess is that you have too small of a dataset for this problem and These classifiers. To exasperate the situation the the dataset is strongly imbalanced. So for these classifiers and choice of loss (you're using cross-entropy) least penalization happens for predicting the dominant class in this scenario.
My suggestions would be to get more data ( label more tweets in this case )