Trying to loop over list of patterns in str_detect( )

54 Views Asked by At

I'm trying to create a list of patterns I want to detect w/in a list of strings in a list-column. I want to create a function such that for each element of the list of patterns I want to use sum(str_detect( )) to find the number of strings in a list that contain that particular pattern. Then, I want to find the sum the values from str_detect >1 and divide that by the sum of all the values that result from str_detect. I want to iterate this over a list_column that where a column contains lists of strings for an observation. Right now I'm not getting consistent results from rStudio. With reprex( ) I get what I would expect, but without it I do not.

Here is a toy example illustrating my current workflow w/ reprex:

  library(magrittr)
  library(dplyr)
  library(tidyr)
  library(rebus)
  library(foreach)
  library(stringr)

###Creating example tibble
example_tibble <- tibble(id = 1:2, strings = list(c("The cat scratched the dog", "It was a dark and stormy night", "Cats kill birds"), 
                                                  c("A big, scary dog", "The dog chased the kitty")))
###Creating list of patterns to match
PatternsList<-list(c("dog"), c("cat"), c("bird"))
String_Comparison<-function(x, PatternsList){
  DescriptorCounts<-foreach(i = seq_along(PatternsList)) %do% { 
    sum(str_detect(x, regex(pattern = PatternsList[i], ignore_case = TRUE)))
  }
  ###Using if statement instead of filter
  common_descriptors_sum <- if(any(unlist(DescriptorCounts) > 2)) {
    sum(unlist(DescriptorCounts[unlist(DescriptorCounts) > 2]))
  }
  ###Get ratio 
  common_ratio <- common_descriptors_sum / sum(unlist(DescriptorCounts))
  return(common_ratio)
}
ExampleTibble_WithComparedStrings <- example_tibble %>% 
  rowwise() %>% 
  mutate(StringsCompared = list(String_Comparison(strings, PatternsList)))
#> Warning: There were 6 warnings in `mutate()`.
#> The first warning was:
#> ℹ In argument: `StringsCompared = list(String_Comparison(strings,
#>   PatternsList))`.
#> ℹ In row 1.
#> Caused by warning in `regex()`:
#> ! Coercing `pattern` to a plain character vector.
#> ℹ Run `dplyr::last_dplyr_warnings()` to see the 5 remaining warnings.
ExampleTibble_WithComparedStrings
#> # A tibble: 2 × 3
#> # Rowwise: 
#>      id strings   StringsCompared
#>   <int> <list>    <list>         
#> 1     1 <chr [3]> <dbl [0]>      
#> 2     2 <chr [2]> <dbl [0]>
###Returns NotANumber, which is not what I expect
###Isolating DescriptorCounts to demonstrate issue
DescriptorCounts <- function(x, PatternsList) {
  foreach(i = seq_along(PatternsList)) %do% { 
    sum(str_detect(x, regex(pattern = PatternsList[[i]], ignore_case = TRUE)))
  }
}
###Will generate lists of [0,0,0]
Output <- example_tibble %>% 
  rowwise() %>% 
  mutate(Output = list(DescriptorCounts(x = strings, PatternsList = PatternsList)))
Output$Output
#> [[1]]
#> [[1]][[1]]
#> [1] 1
#> 
#> [[1]][[2]]
#> [1] 2
#> 
#> [[1]][[3]]
#> [1] 1
#> 
#> 
#> [[2]]
#> [[2]][[1]]
#> [1] 2
#> 
#> [[2]][[2]]
#> [1] 0
#> 
#> [[2]][[3]]
#> [1] 0
###Okay the actual values are in there, but irretrievable?????

Created on 2024-02-04 with reprex v2.1.0 However, when I use rstudio I get the following:

ExampleTibble_WithComparedStrings <- example_tibble %>% 
rowwise() %>% 
+   mutate(StringsCompared = list(String_Comparison(strings, PatternsList)))
> ExampleTibble_WithComparedStrings
#A tibble: 2 × 3
#Rowwise: 
     id strings   StringsCompared
  <int> <list>    <list>         
1     1 <chr [3]> <dbl [0]>      
2     2 <chr [2]> <dbl [0]>      
> ###Returns NotANumber, which is not what I expect
> ###Isolating DescriptorCounts to demonstrate issue
> DescriptorCounts <- function(x, PatternsList) {
+   foreach(i = seq_along(PatternsList)) %do% { 
+     sum(str_detect(x, regex(pattern = PatternsList[[i]], ignore_case = TRUE)))
+   }
+ }

> ###Will generate lists of [0,0,0]
> Output <- example_tibble %>% 
+   rowwise() %>% 
+   mutate(DescriptorCounts = list(DescriptorCounts(x = strings, PatternsList = PatternsList)))
> Output$DescriptorCounts
[[1]]
[[1]][[1]]
[1] 0

[[1]][[2]]
[1] 0

[[1]][[3]]
[1] 0


[[2]]
[[2]][[1]]
[1] 0

[[2]][[2]]
[1] 0

[[2]][[3]]
[1] 0

The 1st version of Output$Output is actually what I expected. That is to say that within the list at example_tibble$strings[[1]] "dog" is present once, "cat" is present twice, and "bird" is present once. My questions are:

  1. How do I get rStudio itself to produce the first, correct, output, which I obtained when I ran my code through reprex( )?
  2. Once that is the case how do I manipulate the lists for DescriptorCounts such that I can take the conditional sum of its contents for each row, divide that by sum(DescriptorCounts), ultimately returning a numeric column with this ratio?
    Is it a matter of unlisting the results of DescriptorCounts one more time?

For what it's worth, I have confirmed that the common_descriptors_sum and the common_descriptors_ratio portions of the NoteComparison function code work in principle w/ numeric examples.

0

There are 0 best solutions below