I am trying to figure the frequency of phrases made up from one to eight words. I have been reading about text mining for phrases here and elsewhere and have found out that using ngram tokenization will be the best way to go.
However, when I copy and paste text from a .txt file it either comes up with an unidentified symbol error for multiple lines. Is it possible to use the readLines
function in place of X in an ngram_Tokenizer code? E.g.:
Bigram_Tokenizer<-function(X(readLines(file.choose())(Ngram_tokenizer(X(readLines(file.choose(),WekaControl(min=#,max=#)
in the example given by tomkauffman at GitHubGist (1)?
When I copy the readLines printout it comes up with 'unexpected [ in [' Do I need to include the same text in both "X" entries?
Thank you, Ben M.