Extracting all possible ngrams in R tm document term matrix

1.2k Views Asked by Sebastian At 07 June 2025 at 06:46

I am using the "tm" package in R to create a term document matrix. Then I use "RWeka" to extract trigrams as specified in the code below

myCorpus <- VCorpus(VectorSource(reddata$Tweet))

#create tokenizer function
TriTok<- function(x) NGramTokenizer(x, Weka_control(min = 3, max = 3))
tdm <- DocumentTermMatrix(myCorpus,control=list(tokenize=TriTok))

The problem here is that RWeka seemingly just goes through the list of terms and splits after every three words to get trigrams. For example the sentence:

 On hot summer days I enjoy eating ice cream

would be split into

"On hot summer"    "days I enjoy"    "eating ice cream"

But for example the phrase

"hot summer days"

would be ignored. Is there a way to get RWeka to include all trigrams or is there another option?

Thanks in advance!

Original Q&A

Extracting all possible ngrams in R tm document term matrix

There are 0 best solutions below

Related Questions in R

Related Questions in TOKENIZE

Related Questions in TM

Related Questions in N-GRAM

Related Questions in RWEKA

Trending Questions

Popular # Hahtags

Popular Questions