tm to tidytext conversion

168 Views Asked by At

I am trying to learn tidytext. I can follow the examples on tidytext website so long as I use the packages (janeaustenr, eg). However, most of my data are text files in a corpus. I can reproduce the tm to tidytext conversion example for sentiment analysis (ap_sentiments) on the tidytext website. I am having trouble, however, understanding how the tidytext data are structured. For example, the austen novels are stored by "book" in the austenr package. For my tm data, however, what is the equivalent for calling the vector for book? Here is the specific example for my data:

'cname <- file.path(".", "greencomments" , "all")

I can then use tidytext successfully after running the tm preprocessing:

practice <- tidy(tdm)
practice
partysentiments <- practice %>%
inner_join(get_sentiments("bing"), by = c(term = "word"))
partysentiments

# A tibble: 170 x 4
term    document count sentiment
<chr>   <chr>    <dbl> <chr>    
1 benefit 1         1.00 positive 
2 best    1         2.00 positive 
3 better  1         7.00 positive 
4 cheaper 1         1.00 positive 
5 clean   1        24.0  positive 
7 clear   1         1.00 positive 
8 concern 1         2.00 negative 
9 cure    1         1.00 positive 
10 destroy 1         3.00 negative 

But, I can't reproduce the simple ggplots of word frequencies in tidytext. Since my data/corpus are not arranged with a column for "book" in the dataframe, the code (and therefore much of the tidytext functionality) won't work.

Here is an example of the issue. This works fine:

practice %>%
count(term, sort = TRUE)

# A tibble: 989 x 2
term        n
<chr>   <int>
1 activ       3
2 air         3
3 altern      3

but, what how to I arrange the tm corpus to match the structure of the books in the austenr package? Is "document" the equivalent of "book"? I have text files in folders for the corpus. I have tried replacing this in the code, and it doesn't work. Maybe I need to rename this? Apologies in advance - I am not a programmer.

0

There are 0 best solutions below