TextMining in R - Extracting 2 gram for only few terms and 1 gram for rest

Question

TextMining in R - Extracting 2 gram for only few terms and 1 gram for rest

363 Views Asked by MysticRenge At 07 June 2025 at 06:56

text = c('the nurse was extremely helpful', 'she was truly a gem','helping', 'no issue', 'not bad')

I want to extract 1-gram token for most words and 2 gram tokens for words such as extremely, no , not

For example when I get tokens they should be as below: the, nurse, was, extremely helpful, she, truly, gem, helping, no issue, not bad

These are the terms that should show in the term document matrix

Thank you for the help!!

Original Q&A

There are 1 best solutions below

**Adam Spannbauer** · Accepted Answer

Here is a possible solution (assuming you want to not split only on c("extremely", "no", "not"), but also want to include words similar to them). The pkg qdapDictionaries has some dictionaries for amplification.words (like "extremely"), negation.words (like "no" & "not"), and more.

Here is an example of how to split on a space except for when the space follows a word in a predefined vector (here we define the vector using amplification.words, negation.words, & deamplification.words from qdapDictionaries). You can change the definition of no_split_words if you want to use a more customized list of words.

performing split

library(stringr)
library(qdapDictionaries)

text <-  c('the nurse was extremely helpful', 'she was truly a gem','helping', 'no issue', 'not bad')

# define list of words where we dont want to split on space
no_split_words <- c(amplification.words, negation.words, deamplification.words)
# collapse words into form "word1|word2| ... |wordn
regex_or       <- paste(no_split_words, collapse="|")
# define regex to split on space given that the prev word not in no_split_words
split_regex    <- regex(paste("((?<!",regex_or,"))\\s"))

# perform split
str_split(text, split_regex)

#output
[[1]]
[1] "the"               "nurse"             "was"               "extremely helpful"

[[2]]
[1] "she"     "was"     "truly a" "gem"    

[[3]]
[1] "helping"

[[4]]
[1] "no issue"

[[5]]
[1] "not bad"

creating dtm with `tidytext`

(assumes above code chunk was already run)

library(tidytext)
library(dplyr)

doc_df <- data_frame(text) %>% 
  mutate(doc_id = row_number())

# creates doc term matrix from tm package
# creates a binary dtm
# can define value as term freq, tfidf, etc for a nonbinary dtm
tm_dtm <- doc_df %>% 
  unnest_tokens(tokens, text, token="regex", pattern=split_regex) %>% 
  mutate(value = 1) %>%  
  cast_dtm(doc_id, tokens, value)

# can coerce to matrix if desired
matrix_dtm <- as.matrix(tm_dtm)

TextMining in R - Extracting 2 gram for only few terms and 1 gram for rest

There are 1 best solutions below

performing split

creating dtm with `tidytext`

Related Questions in R

Related Questions in TM

Related Questions in STRINGR

Related Questions in RWEKA

Trending Questions

Popular # Hahtags

Popular Questions

TextMining in R - Extracting 2 gram for only few terms and 1 gram for rest

There are 1 best solutions below

performing split

creating dtm with tidytext

Related Questions in R

Related Questions in TM

Related Questions in STRINGR

Related Questions in RWEKA

Trending Questions

Popular # Hahtags

Popular Questions

creating dtm with `tidytext`