After researching here on StackOverflow, I came up with the code below to count the relative frequency of words in one of the columns of my dataframe:
df['objeto'] = df['objeto'].apply(unidecode.unidecode)
df['objeto'] = df['objeto'].str.replace('[^\w\s]','')
stop_words = nltk.corpus.stopwords.words('portuguese')
stop_words.extend(['12', 'termo', 'aquisicao', 'vinte', 'demandas'])
counter = Counter()
for word in " ".join(df['objeto']).lower().split():
if word not in stop_words:
counter[word] += 1
print(counter.most_common(10))
for word, count in counter.most_common(100):
print(word, count)
The problem is that the code is taking approximately 30 seconds to execute. What did I do wrong? Is there any way to optimize and improve my code? I intend to create a function like this to do it on other dataframes.
I'm a beginner in pandas, I use it sparingly. I did some research here on stackoverflow. Thank you.
It helps if you provide some sort of runnable example:
The main issue here is not using pandas to do the counting.
pandas has
.value_counts()In this case, you want to get all the words into a single column which you can do with
.explode()You can
.mask()to remove words that are.isin(stop_words)then.value_counts()