I have a pd.dataFrame with about 10.000 rows, each containing a text for which I have to sum up all occurences of words, contained in a lexicon (the lexicon also has about 10k entries).
I have written code that works, but takes quite a long time on my hardware (around 6-8 Minutes) and I strongly suspect that there is a better way to do what I want to do.
The main culprit is the count_sentiments() Function:
def prepare_data(data:pd.DataFrame, lexicon:pd.DataFrame):
"""Calculate the needed features and write them to the provided dataframe"""
# Filter the lexicon to create two lists of words
positiveWords = lexicon[lexicon['sentiment'] > 0]['term'].astype(str).tolist()
negativeWords = lexicon[lexicon['sentiment'] < 0]['term'].astype(str).tolist()
# Create columns for our features 'pos_count', 'neg_count', 'contains_no', 'pron_count', 'contains_exclam', 'token_log'
# The values get calculated by the applied function
# apply() maps a function to all the members of the vector (the pd.Series object)
# This takes around 2-3 Minutes on my hardware
data['pos_count'] = data['review'].apply(count_sentiments, args=(positiveWords,))
# This takes around 4-5 Minutes on my hardware
data['neg_count'] = data['review'].apply(count_sentiments, args=(negativeWords,))
return data
def count_sentiments(document, words):
"""Counts all positive/negative sentiment word occurences in the document"""
sentimentSum = len(re.findall(r'\b(?:' + '|'.join(words) + r')\b', document))
return sentimentSum
Any ideas will be appreciated, thanks in advance!