I having trouble calculating average sentiment of each row in a relatively big dataset (N=36140).
My dataset containts review data from an app on Google Play Store (each row represents one review) and I would like to calculate sentiment of each review using sentiment_by() function.
The problem is that this function takes a lot of time to calculate it.
Here is the link to my dataset in .csv format:
https://drive.google.com/drive/folders/1JdMOGeN3AtfiEgXEu0rAP3XIe3Kc369O?usp=sharing
I have tried using this code:
library(sentimentr)
e_data = read.csv("15_06_2016-15_06_2020__Sygic.csv", stringsAsFactors = FALSE)
sentiment=sentiment_by(e_data$review)
Then I get the following warning message (After I cancel the process when 10+ minutes has passed):
Warning message:
Each time `sentiment_by` is run it has to do sentence boundary disambiguation when a
raw `character` vector is passed to `text.var`. This may be costly of time and
memory. It is highly recommended that the user first runs the raw `character`
vector through the `get_sentences` function.
I have also tried to use the get_sentences() function with the following code, but the sentiment_by() function still needs a lot of time to execute the calculations
e_sentences = e_data$review %>%
get_sentences()
e_sentiment = sentiment_by(e_sentences)
I have datasets regarding the Google Play Store review data and I have used the sentiment_by() function for the past month and it worked very quickly when calculating the sentiment... I only started to run calculations for this long since yesterday.
Is there a way to quickly calculate sentiment for each row on a big dataset.
The algorithm used in
sentimentappears to be O(N^2) once you get above 500 or so individual reviews, which is why it's suddenly taking a lot longer when you upped the size of the dataset significantly. Presumably it's comparing every pair of reviews in some way?I glanced through the help file (
?sentiment) and it doesn't seem to do anything which depends on pairs of reviews so that's a bit odd.produce effectively the same output which means that the
sentimentrpackage has a bug in it causing it to be unnecessarily slow.One solution is just to batch the reviews. This will break the 'by' functionality in
sentiment_by, but I think you should be able to group them yourself before you send them in (or after as it doesnt seem to matter).Takes about 45 seconds on my machine (and should be
O(N)for bigger datasets.