I am following this tutorial to create a document-feature matrix with features defined by my dictionary. What I have now is an output of two columns giving my document ID and the frequency of all features in my dictionary.
library(lubridate)
library(quanteda)
## subset data
item7_corpus_subset <- item_7_corpus |>
filter(year(filing_date) == year_data) |>
head(100) ## edit here, comment if codes work well
# tokenize
item7_tokens <- tokens(item7_corpus_subset,
what = "word",
remove_punct = TRUE,
remove_symbols = TRUE,
remove_numbers = TRUE,
remove_url = TRUE) |>
tokens_ngrams(n = 1:3)
## count words from dictionary
item7_doc_dict <- item7_tokens |>
dfm(tolower = TRUE) |>
dfm_lookup(dictionary = cyber_dict, levels = 1:3)
print(item7_doc_dict)
## Document-feature matrix of: 100 documents, 1 feature (94.00% sparse) and 13 docvars.
## features
## docs cyber_dict
## 1000015_10K_1999_0000912057-00-014793.txt 0
## 1000112_10K_1999_0000930661-00-000704.txt 0
## 1000181_10K_1999_0001000181-00-000001.txt 0
## 1000227_10K_1999_0000950116-00-000643.txt 0
## 1000228_10K_1999_0000889812-00-001326.txt 0
## 1000230_10K_1999_0001005150-00-000103.txt 0
## [ reached max_ndoc ... 94 more documents ]
I want to see the frequency of each keyword rather than the total frequency of all keywords that I have. I am trying to emulate the tutorial which generated that as:
dfmat_irish_lg <- dfm_lookup(dfmat_irish, dictionary = dict_lg, levels = 1)
print(dfmat_irish_lg)
## Document-feature matrix of: 14 documents, 9 features (19.84% sparse) and 6 docvars.
## features
## docs CULTURE ECONOMY ENVIRONMENT GROUPS INSTITUTIONS LAW_AND_ORDER RURAL URBAN VALUES
## Lenihan, Brian (FF) 9 583 21 0 93 11 9 0 19
## Bruton, Richard (FG) 35 201 5 0 95 14 0 0 14
## Burton, Joan (LAB) 33 400 6 3 84 6 2 3 6
## Morgan, Arthur (SF) 56 427 10 0 63 22 2 1 18
## Cowen, Brian (FF) 16 416 24 0 63 4 8 1 13
## Kenny, Enda (FG) 26 211 8 1 53 18 0 2 8
## [ reached max_ndoc ... 8 more documents ]
There were three mistakes:
tokens_ngrams()is used before dictionary analysisexclusive = FALSE, all other words are includedYour code should be