I have a dataset of more than 2.300.000 observations. One variable is dedicated to descriptions (text), and there is sometimes quite long sentences. With all this observations imagine the number of words we have. I want to obtain an output (a data frame) with all the words of this variable, sorted from the most to the least frequent. However, i don't want to take into account some word such as "and", "street", "the" etc.
I tested two codes :
descri1tm <- df %>%
# Transforming in a corpus #
VectorSource() %>%
Corpus() %>%
# Cleaning the corpus #
tm_map(content_transformer(tolower)) %>% #lowercase
tm_map(stripWhitespace) %>%
tm_map(removeNumbers) %>% #numbers
tm_map(removePunctuation) %>% #ponctuation
tm_map(removeWords, stopwords("spanish", "cale","barrio","y","al","en","la","el","entre","del")) %>% # words we don't care about
# Transform in a Matrix #
TermDocumentMatrix() %>%
as.data.frame.matrix() %>%
mutate(name = row.names(.)) %>%
arrange(desc(`1`))
#Creating the data frame #
tidytext <- data_frame(line = 1:nrow(df), Description = df$cdescription)
#Frequency analysis
tidytext <- tidytext %>%
unnest_tokens(word, Description) %>%
anti_join(stop_words) %>%
count(word, sort = TRUE)
head(tidytext, 10)
For this one i think that is not powerful enough, R was running for 24 hours with no result... So i tested this one (found here) :
allwords <- df %>% stringr::str_glue_data("{rownames(.)} cdescription: {cdescription}")
# function to count words in a string #
countwords = function(strings){
# remove extra spaces between words
wr = gsub(pattern = " {2,}", replacement=" ", x=strings)
# remove line breaks
wn = gsub(pattern = '\n', replacement=" ", x=wr)
# remove punctuations
ws = gsub(pattern="[[:punct:]]", replacement="", x=wn)
# split the words
wsp = strsplit(ws, " ")
# sort words in table
wst = data.frame(sort(table(wsp, exclude=""), decreasing=TRUE))
wst
}
all_words <- countwords(allwords)
For this one, two problems : it's not possible to don't take into account some words, and i have the following error message again and again :
Error in table(wsp, exclude = "") : all arguments must have the same length
Does someone have an idea ? Please be kind, it's my very first time with such a dataset, and data science is not my specialty at all !
If the text is stored in a data frame variable, to obtain word frequency and remove Spanish stopwords, you just need the third sequence of your first block of code.
If this is still too much for your system's RAM, simply slice the source DF into smaller dfs, and then append and sum the resulting word counts.