I'm new to Twitter Sentimental Analysis with twitteR, and used the positive.txt and negative.txt from Hu and Liu. I was so glad that everything ran smoothly but the scores for over 1000 tweets all turned out to be neutral (score = 0)? I can't figure out what went wrong, any help is greatly appreciated!
setup_twitter_oauth(consumer_key, consumer_secret, token, token_secret)
#Get tweets about "House of Cards", due to the limitation, we'll set n=1500
netflix.tweets<- searchTwitter("#HouseofCards",n=1500)
tweet=netflix.tweets[[1]]
tweet$getScreenName()
tweet$getText()
netflix.text=laply(netflix.tweets,function(t)t$getText())
head(netflix.text)
write(netflix.text, "HouseofCards_Tweets.txt", ncolumns = 1)
#loaded the positive and negative.txt from Hu and Liu
positive <- scan("/users/xxx/desktop/positive_words.txt", what = character(), comment.char = ";")
negative <- scan("/users/xxx/desktop/negative_words.txt", what = character(), comment.char = ";")
#add positive words
pos.words =c(positive,"miss","Congratulations","approve","watching","enlightening","killing","solid")
scoredsentiment <- function(hoc.vec, pos.word, neagtive)
{
clean <- gsub("(RT|via)((?:\\b\\W*@\\w+)+)", "",hoc.vec)
clean <- gsub("^\\s+|\\s+$", "", clean)
clean <- gsub("[[:punct:]]", "", clean)
clean <- gsub("[^[:graph:]]", "", clean)
clean <- gsub("[[:cntrl:]]", "", clean)
clean <- gsub("@\\w+", "", clean)
clean <- gsub("\\d+", "", clean)
clean <- tolower(clean)
hoc.list <- strsplit(clean, "")
hoc=unlist(hoc.list)
pos.matches = match(hoc, pos.words)
scoredpositive <- sapply(hoc.list, function(x) sum(!is.na(match(pos.matches, positive))))
scorednegative <- sapply(hoc.list, function(x) sum(!is.na(match(x, negative))))
hoc.df <- data.frame(score = scoredpositive - scorednegative, message = hoc.vec, stringsAsFactors = F)
return (hoc.df)
}
twitter_scores <- scoredsentiment(netflix.text, scoredpositive, scorednegative)
print(twitter_scores)
write.csv(twitter_scores, file=paste('twitter_scores.csv'), row.names=TRUE)
#draw a graph to show the final outcome
hist(twitter_scores$score)
qplot(twitter_scores$score)
Everything works, but the score for each tweet is the same (score =0)
From your code, I don't think that the simple match will work. You need to use some form of fuzzy matching scheme. With match, you need the exact word repeated which will not happen a lot and further, you are matching a single word to a string of words.