Pre-processing before running sentiment analysis

3.7k Views Asked by At

Sentiment analysis helps us gauge sentiment of tweets, however many of the tweets we get from the api might really not be 'classifiable' into some sentiment.

Does anyone know of any api/literature that talks about pre-processing a tweet before running any kind of classifier over it (e.g. remove #, remove @name's etc).

Also, what topics/api/literature can i look up if i want determine if it makes sense to run sentiment analysis on a tweet (say as a movie review), before i even begin to run a sentiment analyzer over it?

3

There are 3 best solutions below

1
On

Maybe you should read:

(Then in Python, tweet = re.sub(old_pattern, new_pattern, tweet) for each modification to perform.)

0
On

Actually you'd better do the dirty work by yourself. Regular Expression is easy to remove #,@ or url. Punctuation marks and emojis are quite import for the sentiment analysis. I recommend using Tag of Speech trained by CMU NLP group(http://www.cs.cmu.edu/~ark/TweetNLP/) to express these characters.

For basic features like bag of words and tf-idf scores, I'd like to use Scikit-learn(http://scikit-learn.org/stable/). For single word sentiment, you can use Stanford Nlp sentiment analysis.(http://nlp.stanford.edu/sentiment/)

0
On

I am using TextBlob Library for classifying my dataset.

TextBlob is a Python (2 and 3) library for processing textual data. It provides a simple API for diving into common natural language processing (NLP) tasks such as part-of-speech tagging, noun phrase extraction, sentiment analysis, classification, translation, and more.

Features: -Noun phrase extraction -Part-of-speech tagging -Sentiment analysis -Classification (Naive Bayes, Decision Tree) -Language translation and detection powered by Google Translate -Tokenization (splitting text into words and sentences) -Word and phrase frequencies -Parsing -n-grams -Word inflection (pluralization and singularization) and lemmatization -Spelling correction -Add new models or languages through extensions -WordNet integration

Get it now:

$ pip install -U textblob

$ python -m textblob.download_corpora

Reference: https://textblob.readthedocs.org/en/dev/

*** I cannot tell you the result because this is a part of my thesis and I am still working on.