Searching for specific words in Corpus with R (tm package)

70 Views Asked by At

I have a Corpus (tm package), containing a collection of 1.300 different text documents [Content: documents: 1.300].

My goal is now to search the frequency of a specific wordlist in each of those documents. E.g. if my wordlist contains the words "january, february, march,....". I want to analyze how often the documents refer to these words.

Example: 
Text 1: I like going on holiday in january and not in february.
Text 2: I went on a holiday in march.
Text 3: I like going on vacation.

The result should look like this:

Text 1: 2 
Text 2: 1
Text 3: 0

I tried using the following codes:

library(quanteda)
toks <- tokens(x) 
toks <- tokens_wordstem(toks) 

dtm <- dfm(toks)

dict1 <- dictionary(list(c("january", "february", "march")))

dict_dtm2 <- dfm_lookup(dtm, dict1, nomatch="_unmatched")                                 
tail(dict_dtm2)  

This code was proposed in a different chat, however it does not work on mine and an error, saying it is only applicaple on text or corpus elements occurs.

How can I search for my wordlist using my existing Corpus in tm package in R?

1

There are 1 best solutions below

0
On BEST ANSWER

To make your Quanteda code work, you first have to convert your tm VCorpus object x + fix few other minor issues:

  • dictionary() expects a named list
  • English stemmer will return "januari", "februari" instead of "january", "february".
library(tm)
library(quanteda)

## prepare reprex, create tm VCorpus:
docs <- c("I like going on holiday in january and not in february.",
          "I went on a holiday in march.",
          "I like going on vacation.")
x <- VCorpus(VectorSource(docs))
class(x)
#> [1] "VCorpus" "Corpus"

### tm VCorpus object to Quanteda corpus:
x <- corpus(x)
class(x)
#> [1] "corpus"    "character"

### continue with tokenization and stemmming
toks <- tokens(x) 
toks <- tokens_wordstem(toks) 
dtm <- dfm(toks)

# dictionary() takes a named list, i.e. list(months = c(..))
# and "january", "february" are stemmed to "januari", "februari"
dict1 <- dictionary(list(months = c("januar*", "februar*", "march")))
dict_dtm2 <- dfm_lookup(dtm, dict1, nomatch="_unmatched")                                 
dict_dtm2
#> Document-feature matrix of: 3 documents, 2 features (16.67% sparse) and 7 docvars.
#>        features
#> docs    months _unmatched
#>   text1      2         10
#>   text2      1          7
#>   text3      0          6

Created on 2023-09-02 with reprex v2.0.2