text = c('the nurse was extremely helpful', 'she was truly a gem','helping', 'no issue', 'not bad')
I want to extract 1-gram token for most words and 2 gram tokens for words such as extremely, no , not
For example when I get tokens they should be as below: the, nurse, was, extremely helpful, she, truly, gem, helping, no issue, not bad
These are the terms that should show in the term document matrix
Thank you for the help!!
Here is a possible solution (assuming you want to not split only on
c("extremely", "no", "not")
, but also want to include words similar to them). The pkgqdapDictionaries
has some dictionaries foramplification.words
(like "extremely"),negation.words
(like "no" & "not"), and more.Here is an example of how to split on a space except for when the space follows a word in a predefined vector (here we define the vector using
amplification.words
,negation.words
, &deamplification.words
fromqdapDictionaries
). You can change the definition ofno_split_words
if you want to use a more customized list of words.performing split
creating dtm with
tidytext
(assumes above code chunk was already run)