Remove the most and the least appearing terms from a Term Document Matrix in R

324 Views Asked by At

I'm reading a Korean text file and trying to remove the most appearing terms(stopwords) and the least appearing terms from a Term Document Matrix which is generated in R. From the code below I'm able to get the TDM, but it has weights for all the terms in the document. Is there any way in which I can remove such terms so that I get the TDM for terms which would make more sense? Thanks

library(ktm)
old <- read_csv(file = "Past-Korean1.csv", locale = locale(date_names = "ko", 
encoding = "UTF-8")) 
q <- tokenizer(old$Description, token = "tag")
y_ko <- document_term_frequencies(q[, c("text_id", "word")])
tdm_ko <- document_term_matrix(y_ko)
tdm_ko <- as.DocumentTermMatrix(tdm_ko, weighting=weightTfIdf)
train1_ko <- as.matrix(tdm_ko)
0

There are 0 best solutions below