Stemming words within a column

507 Views Asked by At

I need to use stemming

   D            Words
0   2020-06-19  excellent
1   2020-06-19  make
2   2020-06-19  many
3   2020-06-19  game
4   2020-06-19  play
... ... ...
3042607 2020-07-28  praised
3042608 2020-07-28  playing
3042609 2020-07-28  made
3042610 2020-07-28  terms
3042611 2020-07-28  bad
 

I have tried to use Portstemmer to do it as follows:

from nltk.stem import PorterStemmer 
from nltk.tokenize import word_tokenize 
   
ps = PorterStemmer() 
for w in df.Words: 
    print(w, " : ", ps.stem(w)) 

but I do not get the desired outputs (stemmed words). I will need to keep date (D) information, so at the end I should have a similar dataset but with stemmed words), but I would like to run stemmed words through Words columns in order to have something similar to this:

 D          Words
    0   2020-06-19  excellent
    1   2020-06-19  make
    2   2020-06-19  many
    3   2020-06-19  game
    4   2020-06-19  play
    ... ... ...
    3042607 2020-07-28  praise
    3042608 2020-07-28  play
    3042609 2020-07-28  make
    3042610 2020-07-28  terms
    3042611 2020-07-28  bad

Any tips will be welcomed.

1

There are 1 best solutions below

0
On BEST ANSWER

When I run your code

ps = PorterStemmer() 
for w in df.Words: 
    print(w, " : ", ps.stem(w)) 

it prints the word : stem structure correctly (according to the PorterStemmer at least).

If you want to have the stem as a column in your dataframe, you'll need to create a new column, by applying the ps.stem function over the whole Words column, as this:

df['stem'] = df1.Words.apply(ps.stem)

This turns your dataframe to this form:

    D           Words     stem
0   2020-06-19  excellent excel
1   2020-06-19  make      make
2   2020-06-19  many      mani
3   2020-06-19  game      game
4   2020-06-19  play      play

and so now you can use the stem column for any further analysis without dropping the rest of the data.