I'm trying to plot bigrams from a sample of free comments about meetings held during the last month. I'm using the following method (from the Rweka
package):
BigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 2, max = 3))
dtm <- TermDocumentMatrix(modif.corpus.irri.aff(MyComments),
control = list(tokenize = BigramTokenizer)
where modif.corpus.irri.aff()
is my "To-Corpus-format function" (using stem document by the way).
To display the bar plot, the end of the code is this:
dm <- as.matrix(t(dtm))
v <- apply(dm,2,sum)
v <- sort(v, decreasing = TRUE)
v_top <- sort(v[1:nb.terms])
barplot(v_top, horiz=TRUE, cex.names = 0.5,
las = 1, col=grey.colors(10), main="title",
names.arg = names(v_top))
This works quite well but I want to display "pair occurrences" and not "bigram occurrences", because I want to count ideas expressed more than bigrams.
Just an example to be sure:
I want to merge/concatenate the bar of "long meeting_" with the one of"meeting_ long" because it's the same idea: meetings were too long.
Is there a control parameter dealing with this differentiation in NgramTokenizer
? Or something to add?
Ok, I think the tokenizer did it as expected: "long meeting_", "meeting_ long" are different tokens. So in order to get what you want you can post process the bigrams (you also have trigrams) so you can merge those that the words are the same but just in different order.
Or you can write your own tokenizer, not a hard thing to do, rather simple, where after the split in every bigram of trigram, it does the same check, if all words are the same, then merge those cases. This is not hard to do though.