Finding n-grams in tm does not work following package updates

758 Views Asked by At

I had some text-mining code in R using the tm package that was working well. Then, I updated both R along with the tm and R-Weka packages. Now, the code does not work, and I am not sure why.

My original guide for the code came from: https://gist.github.com/benmarwick/6127413

Neither this code (linked above) nor my code (below) gives the desired results at this point. When my code executed successfully (under previous versions of the packages), it provide n-grams that involved a specific, key word. It would also provide an ordered list of words according to their distance from the key word within the set of n-grams.

There are two specific problems:

  1. One tm feature that is generating an error each time (that may be causing the next/second problem) is the PlainTextDocument. That line of code is:

eventdocs <- tm_map(eventdocs, PlainTextDocument)

The next line of code is:

eventdtm <- DocumentTermMatrix(eventdocs)   

When trying to create the document-text matrix (eventdtm), the code gives the error:

Error in simple_triplet_matrix(i, j, v, nrow = length(terms), ncol = length(corpus), : 'i, j' invalid

I have updated everything, including java, and still this error is arising.

I remarked-out the PlainTextDocument code as the text I am using is already in .txt format, because I found some who said this step was not necessary. When I do this, the document-text matrix is formed (or seems to be formed accurately). But I would like to resolve this error because I previously encountered problems when that line did not execute.

  1. But, regardless of this, there seems to be a problem in the formation of the n-grams. The first block is the most suspect to me. I am not sure the NGramTokenizer is doing what it should.

That code is:

span <- 4 
span1 <- 1 + span * 2 
ngramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = span1, max = span1))
dtmevents <- TermDocumentMatrix(eventdocs, control = list(tokenize = ngramTokenizer))

#find ngrams that have the key word of interest
word <- "keyword"
subset_ngrams <- dtmevents$dimnames$Terms[grep(word, dtmevents$dimnames$Terms)]

subset_ngrams <- subset_ngrams[sapply(subset_ngrams, function(i) {
tmp <- unlist(strsplit(i, split=" "))
tmp <- tmp[length(tmp) - span]
tmp} == word)]

allwords <- paste(subset_ngrams, collapse = " ")
uniques <- unique(unlist(strsplit(allwords, split=" ")))

The uniques set of words is just the key word of interest, with all of the other high-frequency collocates removed (at this point, I know the code is not working). Any help or leads would be appreciated. It took a long time to get things working originally. Then, with the updates, I'm out of action. Thank you.

1

There are 1 best solutions below

0
On

It's tm package version issue. You need to install version 0.6-2. Solutions:

  1. Code - faster:

require(devtools) install_version("tm", version = "0.6-2", repos = "http://cran.r-project.org")

  1. If that doesn't work, download the package and install it manually.