R subset/keep all rows with at least two specific text strings

316 Views Asked by At

I have a dataframe with different text excerpts.

I am looking to subset all the observations that contain at least 2 terms of my little dictionary ("poverty|report|alarming|inflation"), or the same term twice (like report occurs twice in a text).

texts <- data.frame(text = c("report highlights that poverty is widespread", "there is inflation", "alarming reports", "thanks for listening"), id = 1:4, group = 4:7)

texts[grepl("poverty|report|alarming|inflation", texts$text, ignore.case=T),]
#      I don't want this:                      text id group
#1 report highlights that poverty is widespread  1     4
#2                           there is inflation  2     5
#3                             alarming reports  3     6

but i want this:

#                                                     text id group
#1 report highlights that poverty is widespread         1     4
#3                         alarming reports             3     6

2

There are 2 best solutions below

0
On BEST ANSWER

Try this base R approach:

#Data
texts <- data.frame(text = c("report highlights that poverty is widespread", "there is inflation", "alarming reports", "thanks for listening"), id = 1:4, group = 4:7,stringsAsFactors = F)
#Index
Index <- apply(texts[,1,drop=F],1,function(x)sum(grepl("poverty|report|alarming|inflation",
                                                       unlist(strsplit(x,split =' ')),
                                                       ignore.case=T)))
#Subset
texts[which(Index>=2),]

Output:

                                          text id group
1 report highlights that poverty is widespread  1     4
3                             alarming reports  3     6
0
On

Does this work:

> library(stringr)
> library(dplyr)
> texts %>% filter(str_count(text, pattern = "poverty|report|alarming|inflation") > 1)
                                          text id group
1 report highlights that poverty is widespread  1     4
2                             alarming reports  3     6
>