Taking too long to count word frequency in pandas dataframe

38 Views Asked by At

After researching here on StackOverflow, I came up with the code below to count the relative frequency of words in one of the columns of my dataframe:

df['objeto'] = df['objeto'].apply(unidecode.unidecode)
df['objeto'] = df['objeto'].str.replace('[^\w\s]','')

stop_words = nltk.corpus.stopwords.words('portuguese')
stop_words.extend(['12', 'termo', 'aquisicao', 'vinte', 'demandas'])

counter = Counter()

for word in " ".join(df['objeto']).lower().split():
    if word not in stop_words:
        counter[word] += 1

print(counter.most_common(10))

for word, count in counter.most_common(100):
    print(word, count)

The problem is that the code is taking approximately 30 seconds to execute. What did I do wrong? Is there any way to optimize and improve my code? I intend to create a function like this to do it on other dataframes.

I'm a beginner in pandas, I use it sparingly. I did some research here on stackoverflow. Thank you.

1

There are 1 best solutions below

2
jqurious On

It helps if you provide some sort of runnable example:

df = pd.DataFrame(dict(
   id = ['a', 'b', 'c', 'd'],
   objeto = ['Foo bar', 'hello Hi FOO', 'Yes hi Hello', 'Pythons PaNdas yeS']
))

stop_words = ['foo', 'bar']

The main issue here is not using pandas to do the counting.

pandas has .value_counts()

In this case, you want to get all the words into a single column which you can do with .explode()

df['objeto'].str.casefold().str.split().explode()
0        foo
0        bar
1      hello
1         hi
1        foo
2        yes
2         hi
2      hello
3    pythons
3     pandas
3        yes
Name: objeto, dtype: object

You can .mask() to remove words that are .isin(stop_words) then .value_counts()

df['objeto'].str.casefold().str.split().explode().mask(lambda word: word.isin(stop_words)).value_counts()
objeto
hello      2
hi         2
yes        2
pythons    1
pandas     1
Name: count, dtype: int64