Wrong output when counting words in multiple texts

38 Views Asked by At

I have 2 datasets, one with 500 different entities where some variables are measured. The other has 500 texts where each text belongs to the entity in the first dataset. I want to search for 3 keywords in these texts and count how many times total keywords occur in each text.

Some random data to work with as a random representation, keywords is a vector, texts is a list with the texts (I have a list, don't know if my example list is correct here), and df is the dataframe with the variables for my entities:

keywords <- c("ab", "cd", "ef")
texts <- as.list("ab is ef when ef is ef",
                 "something something nothing",
                 "cd is cd is ab is ab and ef")
var1 <- c("area1", "area2", "area3")
var2 <- c("15", "5", "23")
df <- data.frame(var1, var2)
colnames(df) <- c("location", "temperature")

The right answer here is that keywords occur 4 times in the first text, 0 times in the second, and 5 times in the third. However, when I try the following it gives the wrong output:

df$count <- 0 # Store the results
# counting for all keywords
for(w in keywords){
  df$count <- 
    df$count + 
    grepl(w, texts, ignore.case = T)
 print(w)
}

df$count

Any tips on what I can do? Preferably with some example code?

Thanks in advance

1

There are 1 best solutions below

2
On BEST ANSWER

Your texts is a list. Is there a reason for that? Rather make it a vector.

And you can also go easier in counting. Maybe try the stringr package. Then you can do

library(stringr)

keywords <- c("ab", "cd", "ef")
texts <- c("ab is ef when ef is ef",
                 "something something nothing",
                 "cd is cd is ab is ab and ef")

str_count(texts, "ab|cd|ef")

[1] 4 0 5

If you can't set up the pattern as above, you can also go for

str_count(texts, paste(keywords, collapse = "|"))