I had some text-mining code in R using the tm package that was working well. Then, I updated both R along with the tm and R-Weka packages. Now, the code does not work, and I am not sure why.
My original guide for the code came from: https://gist.github.com/benmarwick/6127413
Neither this code (linked above) nor my code (below) gives the desired results at this point. When my code executed successfully (under previous versions of the packages), it provide n-grams that involved a specific, key word. It would also provide an ordered list of words according to their distance from the key word within the set of n-grams.
There are two specific problems:
- One tm feature that is generating an error each time (that may be causing the next/second problem) is the PlainTextDocument. That line of code is:
eventdocs <- tm_map(eventdocs, PlainTextDocument)
The next line of code is:
eventdtm <- DocumentTermMatrix(eventdocs)
When trying to create the document-text matrix (eventdtm), the code gives the error:
Error in simple_triplet_matrix(i, j, v, nrow = length(terms), ncol = length(corpus), : 'i, j' invalid
I have updated everything, including java, and still this error is arising.
I remarked-out the PlainTextDocument code as the text I am using is already in .txt format, because I found some who said this step was not necessary. When I do this, the document-text matrix is formed (or seems to be formed accurately). But I would like to resolve this error because I previously encountered problems when that line did not execute.
- But, regardless of this, there seems to be a problem in the formation of the n-grams. The first block is the most suspect to me. I am not sure the NGramTokenizer is doing what it should.
That code is:
span <- 4
span1 <- 1 + span * 2
ngramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = span1, max = span1))
dtmevents <- TermDocumentMatrix(eventdocs, control = list(tokenize = ngramTokenizer))
#find ngrams that have the key word of interest
word <- "keyword"
subset_ngrams <- dtmevents$dimnames$Terms[grep(word, dtmevents$dimnames$Terms)]
subset_ngrams <- subset_ngrams[sapply(subset_ngrams, function(i) {
tmp <- unlist(strsplit(i, split=" "))
tmp <- tmp[length(tmp) - span]
tmp} == word)]
allwords <- paste(subset_ngrams, collapse = " ")
uniques <- unique(unlist(strsplit(allwords, split=" ")))
The uniques set of words is just the key word of interest, with all of the other high-frequency collocates removed (at this point, I know the code is not working). Any help or leads would be appreciated. It took a long time to get things working originally. Then, with the updates, I'm out of action. Thank you.
It's tm package version issue. You need to install version 0.6-2. Solutions:
require(devtools) install_version("tm", version = "0.6-2", repos = "http://cran.r-project.org")