I need to use stemming
D Words
0 2020-06-19 excellent
1 2020-06-19 make
2 2020-06-19 many
3 2020-06-19 game
4 2020-06-19 play
... ... ...
3042607 2020-07-28 praised
3042608 2020-07-28 playing
3042609 2020-07-28 made
3042610 2020-07-28 terms
3042611 2020-07-28 bad
I have tried to use Portstemmer to do it as follows:
from nltk.stem import PorterStemmer
from nltk.tokenize import word_tokenize
ps = PorterStemmer()
for w in df.Words:
print(w, " : ", ps.stem(w))
but I do not get the desired outputs (stemmed words). I will need to keep date (D) information, so at the end I should have a similar dataset but with stemmed words), but I would like to run stemmed words through Words columns in order to have something similar to this:
D Words
0 2020-06-19 excellent
1 2020-06-19 make
2 2020-06-19 many
3 2020-06-19 game
4 2020-06-19 play
... ... ...
3042607 2020-07-28 praise
3042608 2020-07-28 play
3042609 2020-07-28 make
3042610 2020-07-28 terms
3042611 2020-07-28 bad
Any tips will be welcomed.
When I run your code
it prints the
word : stem
structure correctly (according to the PorterStemmer at least).If you want to have the stem as a column in your dataframe, you'll need to create a new column, by applying the
ps.stem
function over the wholeWords
column, as this:This turns your dataframe to this form:
and so now you can use the
stem
column for any further analysis without dropping the rest of the data.