I have 2 datasets, one with 500 different entities where some variables are measured. The other has 500 texts where each text belongs to the entity in the first dataset. I want to search for 3 keywords in these texts and count how many times total keywords occur in each text.
Some random data to work with as a random representation, keywords is a vector, texts is a list with the texts (I have a list, don't know if my example list is correct here), and df is the dataframe with the variables for my entities:
keywords <- c("ab", "cd", "ef")
texts <- as.list("ab is ef when ef is ef",
"something something nothing",
"cd is cd is ab is ab and ef")
var1 <- c("area1", "area2", "area3")
var2 <- c("15", "5", "23")
df <- data.frame(var1, var2)
colnames(df) <- c("location", "temperature")
The right answer here is that keywords occur 4 times in the first text, 0 times in the second, and 5 times in the third. However, when I try the following it gives the wrong output:
df$count <- 0 # Store the results
# counting for all keywords
for(w in keywords){
df$count <-
df$count +
grepl(w, texts, ignore.case = T)
print(w)
}
df$count
Any tips on what I can do? Preferably with some example code?
Thanks in advance
Your
texts
is a list. Is there a reason for that? Rather make it a vector.And you can also go easier in counting. Maybe try the
stringr
package. Then you can doIf you can't set up the pattern as above, you can also go for