steamming words with r

Question

steamming words with r

328 Views Asked by Tomer At 27 June 2025 at 22:03

I'm having a difficulties to understand R stemming word process.

In my example, i created the following corpus object

a <- Corpus(VectorSource("device so much more funand  unlike most android torrent download clients"))

So a is

a[[1]]$content

[1] "device so much more funand  unlike most android torrent download clients"

The first word in this string is "device", I created my term matrix

b <- TermDocumentMatrix(a, control = list(stemming = TRUE))

and got this as an output

dimnames(b)$Terms
[1] "android"  "client"   "devic"    "download" "funand"   "more"     "most"      "much"     "torrent" 
[10] "unlik"

What i like to know is why i lost the "e" at "device" and "unlike" but did not loss it at "more".

how can i avoid this from happening in this word and in some others?

Thanks.

Original Q&A

There are 2 best solutions below

jeremycg On 26 August 2015 at 21:48

I'm assuming you are using the tm and SnowballC packages.

Stemming in these packages works using the Porter Stemming algorithm (in English).

If you want to play around with stemming algorithms, you can run:

getStemLanguages()

and try using others - The only other English built in is here:

wordStem(words, language = "english")

Which for your data, returns the same:

 [1] "android"  "client"   "devic"    "download" "funand"   "more"     "most"     "much"     "torrent" 
[10] "unlik"

**jlhoward** · Accepted Answer

Another option is to use the MorphAdorner lemmatizer at Northwestern University. This answer has the code for the lemmatize(...) function.

library(tm)
a     <- Corpus(VectorSource("device so much more funand  unlike most android torrent download clients"))
words <- Terms(TermDocumentMatrix(a))
lemmatize(words)
#    android    clients     device   download     funand       more       most       much    torrent     unlike 
#  "android"   "client"   "device" "download"   "funand"     "more"     "most"     "much"  "torrent"   "unlike"

As you can see, it removes the "s" from "clients" but not the "e" from "device".

steamming words with r

There are 2 best solutions below

Related Questions in R

Related Questions in NLP

Related Questions in TM

Related Questions in SNOWBALL

Trending Questions

Popular # Hahtags

Popular Questions