with the fallow code I try to find the tfidf for each term for all docs tha I have in csv(200.000 docs), and then I want to make a one column csv that it will contain each term with its tfidf, in non-decreasing. I try for a little sample and I think it works. put for the big csv Rstudio allways crasing.. any ideas?
#read text converted to csv
myfile3 <- "tweetsc.csv"
x <- read.csv(myfile3, header = FALSE)
#make data frame
x <- data.frame(lapply(x, as.character), stringsAsFactors=FALSE)
# make vector sources
dd <- Corpus(DataframeSource(x))
# from tm package conculate tfidf
xx <- as.matrix(DocumentTermMatrix(dd, control = list(weighting = weightTfIdf)))
#data frame from columns to rows decreasing
freq = data.frame(sort(colSums(as.matrix(xx)), decreasing=FALSE))
write.csv2(freq, "important_tweets.csv")
Do not coerce the TDM to a matrix. That will most likely cause an integer overflow issue with so many documents. The tm package uses the
slam
package to represent the tdm/dtm's. It has some functions for doing row- or column-wise operations without having to coerce to dense matrix.One thing to note: you mention you want to calculate "each term with its tfidf..." the tf-idf is specific to each term in each document. Summing the tf-idf may not really a meaningful measure because it obscures the weight of the term in a given document.