how to combine/collapse rows with similar words and sum up their values?

51 Views Asked by At

I did text mining using Quanteda package. Using kwic feature, I generated this output that identifies keywords in my dictionary and identifies which keywords matched within each key. The data looks like this:

docname     keyword        my_dict          
   <chr>       <chr>          <chr>            
 1 avan-21.pdf sustainable    transition       
 2 avan-21.pdf electricity    low_carbon_energy
 3 avan-21.pdf electricity    low_carbon_energy
 4 avan-21.pdf renewable      low_carbon_energy
 5 avan-21.pdf electricity    low_carbon_energy
 6 avan-21.pdf wind           low_carbon_energy
 7 avan-21.pdf wind           low_carbon_energy
 8 avan-21.pdf solar          low_carbon_energy
 9 avan-21.pdf emissions      emissions        
10 avan-21.pdf emissions-free emissions        
11 avan-21.pdf sustainable    transition       
12 avan-21.pdf renewable      low_carbon_energy
13 avan-21.pdf wind           low_carbon_energy
14 avan-21.pdf solar          low_carbon_energy
15 avan-21.pdf biomass        low_carbon_energy
16 avan-21.pdf sustainability transition       
17 avan-21.pdf sustainability transition       
18 avan-21.pdf sustainability transition       
19 avan-21.pdf sustainability transition       
20 avan-21.pdf sustainability transition  

I filtered this data by dictionary keys (my_dict) to create sub-categories like this:

climate_change <- kwic2filter %>% 
  filter(my_dict == "climate_change") %>%
  select(docname, keyword) %>% 
  group_by(docname, keyword) %>% 
  count(keyword, sort = TRUE) %>% 
  arrange(keyword, desc(n))%>%
  write.csv("energy-output/climate_change.csv")

Results look like this:

 X     docname          keyword   n
1   1 enel-22.pdf              1.5  97
2   2 enel-21.pdf              1.5  66
3   3  nrg-21.pdf              1.5   7
4   4  nrg-22.pdf              1.5   4
5   5  nee-21.pdf              1.5   2
6   6 avan-22.pdf              1.5   1
7   7  nee-22.pdf              1.5   1
8   8  nrg-21.pdf       1.5 degree   2
9   9  nrg-22.pdf       1.5 degree   2
10 10  nrg-21.pdf      1.5 degrees   3
11 11  nee-21.pdf      1.5 degrees   1
12 12  nee-22.pdf      1.5 degrees   1
13 13  nee-21.pdf       1.5-degree   1
14 14 enel-22.pdf             1.50   1
15 15 enel-22.pdf          1.52030   1
16 16 enel-22.pdf          1.52040   1
17 17 enel-22.pdf             1.53   1
18 18 enel-21.pdf            1.558   1
19 19  nee-21.pdf             1.56   2
20 20 enel-22.pdf             1.56   1
21 21 enel-21.pdf            1.565   1
22 22 enel-22.pdf             1.58   1
23 23 enel-21.pdf            1.580   1
24 24 enel-22.pdf            1.5is   1
25 25 enel-21.pdf          CLIMATE   1
26 26 enel-22.pdf          CLIMATE   1
27 27  nrg-21.pdf          CLIMATE   1
28 28 enel-22.pdf             IPCC  14
29 29 avan-21.pdf             IPCC   8
30 30 avan-22.pdf             IPCC   8
31 31 enel-21.pdf             IPCC   8
32 32  nee-21.pdf             IPCC   2
33 33 enel-22.pdf           UNFCCC   2
34 34 enel-22.pdf          climate 553
35 35 enel-21.pdf          climate 421
36 36  nee-21.pdf          climate 128
37 37  nee-22.pdf          climate 111
38 38  nrg-22.pdf          climate  54
39 39 avan-22.pdf          climate  49
40 40  nrg-21.pdf          climate  45
41 41 avan-21.pdf          climate  29
42 42  nee-21.pdf         climate-   2
43 43  nrg-21.pdf         climate-   2
44 44 enel-21.pdf         climate-   1
45 45  nee-22.pdf         climate-   1
46 46  nrg-22.pdf         climate-   1
47 47 enel-22.pdf    climate-aware   1
48 48  nee-21.pdf   climate-change   3
49 49  nee-22.pdf   climate-change   2
50 50 enel-22.pdf climate-changing   3

I'd like to arrange this data by docname and combine values for keywords 1.5, 1.5 degree, and 1.5 degrees and remove/ exclude other rows that have 1.55, 1.56 and other numbers. Similarly, I'd like to combine all rows and their values that begin with climate or climate-. Using tidyr::pivot_wider or some other function, I'd like the final data to look like this:

keyword avan-21.pdf avan-22.pdf enel-21.pdf enel-22.pdf nee-21.pdf
1.5 0       0       66      99      2
climate 53      71      669     967     1

My ultimate aim is to calculate term frequency for each dictionary category.

1

There are 1 best solutions below

0
r2evans On BEST ANSWER

I'm not sure about your actual expected values (I can't find 53, for instance), but perhaps this will give you enough to finesse the last few steps.

library(dplyr)
library(tidyr) # pivot_wider
climate_change %>%
  mutate(keyword = sub("[- ]+degrees?$", "", keyword)) %>%
  filter(keyword %in% c("1.5", "climate")) %>%
  summarize(n = sum(n), .by = c(docname, keyword)) %>%
  pivot_wider(id_cols = keyword, names_from = docname, values_from = n)
# # A tibble: 2 × 9
#   keyword `enel-22.pdf` `enel-21.pdf` `nrg-21.pdf` `nrg-22.pdf` `nee-21.pdf` `avan-22.pdf` `nee-22.pdf` `avan-21.pdf`
#   <chr>           <int>         <int>        <int>        <int>        <int>         <int>        <int>         <int>
# 1 1.5                97            66           12            6            4             1            2            NA
# 2 climate           553           421           45           54          128            49          111            29

If you want to combine all keywords that start with climate including climate-changing, then adjust the sub inside the mutate to the follow (remaining lines unchanged):

climate_change %>%
  mutate(keyword = sub("([- ]+degrees?|-[a-z]*)$", "", keyword)) %>%
  filter(keyword %in% c("1.5", "climate")) %>%
  summarize(n = sum(n), .by = c(docname, keyword)) %>%
  pivot_wider(id_cols = keyword, names_from = docname, values_from = n)
# # A tibble: 2 × 9
#   keyword `enel-22.pdf` `enel-21.pdf` `nrg-21.pdf` `nrg-22.pdf` `nee-21.pdf` `avan-22.pdf` `nee-22.pdf` `avan-21.pdf`
#   <chr>           <int>         <int>        <int>        <int>        <int>         <int>        <int>         <int>
# 1 1.5                97            66           12            6            4             1            2            NA
# 2 climate           557           422           47           55          133            49          114            29

Data

climate_change <- structure(list(X = 1:50, docname = c("enel-22.pdf", "enel-21.pdf", "nrg-21.pdf", "nrg-22.pdf", "nee-21.pdf", "avan-22.pdf", "nee-22.pdf", "nrg-21.pdf", "nrg-22.pdf", "nrg-21.pdf", "nee-21.pdf", "nee-22.pdf", "nee-21.pdf", "enel-22.pdf", "enel-22.pdf", "enel-22.pdf", "enel-22.pdf", "enel-21.pdf", "nee-21.pdf", "enel-22.pdf", "enel-21.pdf", "enel-22.pdf", "enel-21.pdf", "enel-22.pdf", "enel-21.pdf", "enel-22.pdf", "nrg-21.pdf", "enel-22.pdf", "avan-21.pdf", "avan-22.pdf", "enel-21.pdf", "nee-21.pdf",  "enel-22.pdf", "enel-22.pdf", "enel-21.pdf", "nee-21.pdf", "nee-22.pdf", "nrg-22.pdf", "avan-22.pdf", "nrg-21.pdf", "avan-21.pdf", "nee-21.pdf", "nrg-21.pdf", "enel-21.pdf", "nee-22.pdf", "nrg-22.pdf", "enel-22.pdf", "nee-21.pdf", "nee-22.pdf", "enel-22.pdf"), keyword = c("1.5", "1.5", "1.5", "1.5", "1.5", "1.5", "1.5", "1.5 degree", "1.5 degree", "1.5 degrees", "1.5 degrees", "1.5 degrees", "1.5-degree", "1.50", "1.52030", "1.52040", "1.53", "1.558", "1.56", "1.56", "1.565", "1.58", "1.580", "1.5is",  "CLIMATE", "CLIMATE", "CLIMATE", "IPCC", "IPCC", "IPCC", "IPCC", "IPCC", "UNFCCC", "climate", "climate", "climate", "climate", "climate", "climate", "climate", "climate", "climate-", "climate-", "climate-", "climate-", "climate-", "climate-aware", "climate-change", "climate-change", "climate-changing"), n = c(97L, 66L, 7L, 4L, 2L, 1L, 1L, 2L, 2L, 3L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 14L, 8L, 8L, 8L, 2L, 2L, 553L, 421L, 128L, 111L, 54L, 49L, 45L, 29L, 2L, 2L, 1L,  1L, 1L, 1L, 3L, 2L, 3L)), class = "data.frame", row.names = c("1", "2", "3", "4", "5", "6", "7", "8", "9", "10", "11", "12", "13", "14", "15", "16", "17", "18", "19", "20", "21", "22", "23", "24", "25", "26", "27", "28", "29", "30", "31", "32", "33", "34", "35", "36", "37", "38", "39", "40", "41", "42", "43", "44", "45", "46", "47", "48", "49", "50"))