Extracting all possible ngrams in R tm document term matrix

1.2k Views Asked by At

I am using the "tm" package in R to create a term document matrix. Then I use "RWeka" to extract trigrams as specified in the code below

myCorpus <- VCorpus(VectorSource(reddata$Tweet))

#create tokenizer function
TriTok<- function(x) NGramTokenizer(x, Weka_control(min = 3, max = 3))
tdm <- DocumentTermMatrix(myCorpus,control=list(tokenize=TriTok))

The problem here is that RWeka seemingly just goes through the list of terms and splits after every three words to get trigrams. For example the sentence:

 On hot summer days I enjoy eating ice cream 

would be split into

"On hot summer"    "days I enjoy"    "eating ice cream"

But for example the phrase

"hot summer days"

would be ignored. Is there a way to get RWeka to include all trigrams or is there another option?

Thanks in advance!

0

There are 0 best solutions below