I'm reading a Korean text file and trying to remove the most appearing terms(stopwords) and the least appearing terms from a Term Document Matrix which is generated in R. From the code below I'm able to get the TDM, but it has weights for all the terms in the document. Is there any way in which I can remove such terms so that I get the TDM for terms which would make more sense? Thanks
library(ktm)
old <- read_csv(file = "Past-Korean1.csv", locale = locale(date_names = "ko",
encoding = "UTF-8"))
q <- tokenizer(old$Description, token = "tag")
y_ko <- document_term_frequencies(q[, c("text_id", "word")])
tdm_ko <- document_term_matrix(y_ko)
tdm_ko <- as.DocumentTermMatrix(tdm_ko, weighting=weightTfIdf)
train1_ko <- as.matrix(tdm_ko)