I know this has been asked multiple times. For example
Finding 2 & 3 word Phrases Using R TM Package
However, I don't know why none of these solutions work with my data. The result is always one-gram word no matter how many ngram I chose (2, 3 or 4) for the ngram
.
Could anybody know the reason why? I suspect the encoding is the reason.
Edited: a small part of the data.
comments <- c("Merge branch 'master' of git.internal.net:/git/live/LegacyCodebase into problem_70918\n",
"Merge branch 'master' of git.internal.net:/git/live/LegacyCodebase into tm-247\n",
"Merge branch 'php5.3-upgrade-sprint6-7' of git.internal.net:/git/pn-project/LegacyCodebase into release2012.08\n",
"Merge remote-tracking branch 'dmann1/p71148-s3-callplan_mapping' into lcst-operational-changes\n",
"Merge branch 'master' of git.internal.net:/git/live/LegacyCodebase into TASK-360148\n",
"Merge remote-tracking branch 'grockett/rpr-pre' into rpr-lite\n"
)
cleanCorpus <- function(vector){
corpus <- Corpus(VectorSource(vector), readerControl = list(language = "en_US"))
corpus <- tm_map(corpus, removeNumbers)
corpus <- tm_map(corpus, tolower)
#corpus <- tm_map(corpus, stripWhitespace)
corpus <- tm_map(corpus, removePunctuation)
#corpus <- tm_map(corpus, PlainTextDocument)
corpus <- tm_map(corpus, removeWords, stopwords("english"))
return(corpus)
}
# this function is provided by a team member (in the link I posted above)
test <- function(keywords_doc){
BigramTokenizer <- function(x)
unlist(lapply(ngrams(words(x), 2), paste, collapse = " "), use.names = FALSE)
# creating of document matrix
keywords_matrix <- TermDocumentMatrix(keywords_doc, control = list(tokenize = BigramTokenizer))
# remove sparse terms
keywords_naremoval <- removeSparseTerms(keywords_matrix, 0.99)
# Frequency of the words appearing
keyword.freq <- rowSums(as.matrix(keywords_naremoval))
subsetkeyword.freq <-subset(keyword.freq, keyword.freq >=20)
frequentKeywordSubsetDF <- data.frame(term = names(subsetkeyword.freq), freq = subsetkeyword.freq)
# Sorting of the words
frequentKeywordDF <- data.frame(term = names(keyword.freq), freq = keyword.freq)
frequentKeywordSubsetDF <- frequentKeywordSubsetDF[with(frequentKeywordSubsetDF, order(-frequentKeywordSubsetDF$freq)), ]
frequentKeywordDF <- frequentKeywordDF[with(frequentKeywordDF, order(-frequentKeywordDF$freq)), ]
# Printing of the words
# wordcloud(frequentKeywordDF$term, freq=frequentKeywordDF$freq, random.order = FALSE, rot.per=0.35, scale=c(5,0.5), min.freq = 30, colors = brewer.pal(8,"Dark2"))
return(frequentKeywordDF)
}
corpus <- cleanCorpus(comments)
t <- test(corpus)
> head(t)
term freq
added added 6
html html 6
tracking tracking 6
common common 4
emails emails 4
template template 4
Thanks,
I haven't found the reason either, but if you are only interested in the counts regardless in which documents the bigrams occured, you could get them alternatively via this pipeline: