I did text mining using Quanteda package. Using kwic feature, I generated this output that identifies keywords in my dictionary and identifies which keywords matched within each key. The data looks like this:
docname keyword my_dict
<chr> <chr> <chr>
1 avan-21.pdf sustainable transition
2 avan-21.pdf electricity low_carbon_energy
3 avan-21.pdf electricity low_carbon_energy
4 avan-21.pdf renewable low_carbon_energy
5 avan-21.pdf electricity low_carbon_energy
6 avan-21.pdf wind low_carbon_energy
7 avan-21.pdf wind low_carbon_energy
8 avan-21.pdf solar low_carbon_energy
9 avan-21.pdf emissions emissions
10 avan-21.pdf emissions-free emissions
11 avan-21.pdf sustainable transition
12 avan-21.pdf renewable low_carbon_energy
13 avan-21.pdf wind low_carbon_energy
14 avan-21.pdf solar low_carbon_energy
15 avan-21.pdf biomass low_carbon_energy
16 avan-21.pdf sustainability transition
17 avan-21.pdf sustainability transition
18 avan-21.pdf sustainability transition
19 avan-21.pdf sustainability transition
20 avan-21.pdf sustainability transition
I filtered this data by dictionary keys (my_dict) to create sub-categories like this:
climate_change <- kwic2filter %>%
filter(my_dict == "climate_change") %>%
select(docname, keyword) %>%
group_by(docname, keyword) %>%
count(keyword, sort = TRUE) %>%
arrange(keyword, desc(n))%>%
write.csv("energy-output/climate_change.csv")
Results look like this:
X docname keyword n
1 1 enel-22.pdf 1.5 97
2 2 enel-21.pdf 1.5 66
3 3 nrg-21.pdf 1.5 7
4 4 nrg-22.pdf 1.5 4
5 5 nee-21.pdf 1.5 2
6 6 avan-22.pdf 1.5 1
7 7 nee-22.pdf 1.5 1
8 8 nrg-21.pdf 1.5 degree 2
9 9 nrg-22.pdf 1.5 degree 2
10 10 nrg-21.pdf 1.5 degrees 3
11 11 nee-21.pdf 1.5 degrees 1
12 12 nee-22.pdf 1.5 degrees 1
13 13 nee-21.pdf 1.5-degree 1
14 14 enel-22.pdf 1.50 1
15 15 enel-22.pdf 1.52030 1
16 16 enel-22.pdf 1.52040 1
17 17 enel-22.pdf 1.53 1
18 18 enel-21.pdf 1.558 1
19 19 nee-21.pdf 1.56 2
20 20 enel-22.pdf 1.56 1
21 21 enel-21.pdf 1.565 1
22 22 enel-22.pdf 1.58 1
23 23 enel-21.pdf 1.580 1
24 24 enel-22.pdf 1.5is 1
25 25 enel-21.pdf CLIMATE 1
26 26 enel-22.pdf CLIMATE 1
27 27 nrg-21.pdf CLIMATE 1
28 28 enel-22.pdf IPCC 14
29 29 avan-21.pdf IPCC 8
30 30 avan-22.pdf IPCC 8
31 31 enel-21.pdf IPCC 8
32 32 nee-21.pdf IPCC 2
33 33 enel-22.pdf UNFCCC 2
34 34 enel-22.pdf climate 553
35 35 enel-21.pdf climate 421
36 36 nee-21.pdf climate 128
37 37 nee-22.pdf climate 111
38 38 nrg-22.pdf climate 54
39 39 avan-22.pdf climate 49
40 40 nrg-21.pdf climate 45
41 41 avan-21.pdf climate 29
42 42 nee-21.pdf climate- 2
43 43 nrg-21.pdf climate- 2
44 44 enel-21.pdf climate- 1
45 45 nee-22.pdf climate- 1
46 46 nrg-22.pdf climate- 1
47 47 enel-22.pdf climate-aware 1
48 48 nee-21.pdf climate-change 3
49 49 nee-22.pdf climate-change 2
50 50 enel-22.pdf climate-changing 3
I'd like to arrange this data by docname and combine values for keywords 1.5, 1.5 degree, and 1.5 degrees and remove/ exclude other rows that have 1.55, 1.56 and other numbers. Similarly, I'd like to combine all rows and their values that begin with climate or climate-. Using tidyr::pivot_wider or some other function, I'd like the final data to look like this:
keyword avan-21.pdf avan-22.pdf enel-21.pdf enel-22.pdf nee-21.pdf
1.5 0 0 66 99 2
climate 53 71 669 967 1
My ultimate aim is to calculate term frequency for each dictionary category.
I'm not sure about your actual expected values (I can't find
53, for instance), but perhaps this will give you enough to finesse the last few steps.If you want to combine all keywords that start with
climateincludingclimate-changing, then adjust thesubinside the mutate to the follow (remaining lines unchanged):Data