I am using the "tm" package in R to create a term document matrix. Then I use "RWeka" to extract trigrams as specified in the code below
myCorpus <- VCorpus(VectorSource(reddata$Tweet))
#create tokenizer function
TriTok<- function(x) NGramTokenizer(x, Weka_control(min = 3, max = 3))
tdm <- DocumentTermMatrix(myCorpus,control=list(tokenize=TriTok))
The problem here is that RWeka seemingly just goes through the list of terms and splits after every three words to get trigrams. For example the sentence:
On hot summer days I enjoy eating ice cream
would be split into
"On hot summer" "days I enjoy" "eating ice cream"
But for example the phrase
"hot summer days"
would be ignored. Is there a way to get RWeka to include all trigrams or is there another option?
Thanks in advance!