steamming words with r

315 Views Asked by At

I'm having a difficulties to understand R stemming word process.

In my example, i created the following corpus object

a <- Corpus(VectorSource("device so much more funand  unlike most android torrent download clients"))

So a is

a[[1]]$content

[1] "device so much more funand  unlike most android torrent download clients"

The first word in this string is "device", I created my term matrix

b <- TermDocumentMatrix(a, control = list(stemming = TRUE)) 

and got this as an output

dimnames(b)$Terms
[1] "android"  "client"   "devic"    "download" "funand"   "more"     "most"      "much"     "torrent" 
[10] "unlik"

What i like to know is why i lost the "e" at "device" and "unlike" but did not loss it at "more".

how can i avoid this from happening in this word and in some others?

Thanks.

2

There are 2 best solutions below

0
On BEST ANSWER

Another option is to use the MorphAdorner lemmatizer at Northwestern University. This answer has the code for the lemmatize(...) function.

library(tm)
a     <- Corpus(VectorSource("device so much more funand  unlike most android torrent download clients"))
words <- Terms(TermDocumentMatrix(a))
lemmatize(words)
#    android    clients     device   download     funand       more       most       much    torrent     unlike 
#  "android"   "client"   "device" "download"   "funand"     "more"     "most"     "much"  "torrent"   "unlike" 

As you can see, it removes the "s" from "clients" but not the "e" from "device".

0
On

I'm assuming you are using the tm and SnowballC packages.

Stemming in these packages works using the Porter Stemming algorithm (in English).

If you want to play around with stemming algorithms, you can run:

getStemLanguages()

and try using others - The only other English built in is here:

wordStem(words, language = "english")

Which for your data, returns the same:

 [1] "android"  "client"   "devic"    "download" "funand"   "more"     "most"     "much"     "torrent" 
[10] "unlik"