I'm trying to create a list of patterns I want to detect w/in a list of strings in a list-column. I want to create a function such that for each element of the list of patterns I want to use sum(str_detect( )) to find the number of strings in a list that contain that particular pattern. Then, I want to find the sum the values from str_detect >1 and divide that by the sum of all the values that result from str_detect. I want to iterate this over a list_column that where a column contains lists of strings for an observation. Right now I'm not getting consistent results from rStudio. With reprex( ) I get what I would expect, but without it I do not.
Here is a toy example illustrating my current workflow w/ reprex:
library(magrittr)
library(dplyr)
library(tidyr)
library(rebus)
library(foreach)
library(stringr)
###Creating example tibble
example_tibble <- tibble(id = 1:2, strings = list(c("The cat scratched the dog", "It was a dark and stormy night", "Cats kill birds"),
c("A big, scary dog", "The dog chased the kitty")))
###Creating list of patterns to match
PatternsList<-list(c("dog"), c("cat"), c("bird"))
String_Comparison<-function(x, PatternsList){
DescriptorCounts<-foreach(i = seq_along(PatternsList)) %do% {
sum(str_detect(x, regex(pattern = PatternsList[i], ignore_case = TRUE)))
}
###Using if statement instead of filter
common_descriptors_sum <- if(any(unlist(DescriptorCounts) > 2)) {
sum(unlist(DescriptorCounts[unlist(DescriptorCounts) > 2]))
}
###Get ratio
common_ratio <- common_descriptors_sum / sum(unlist(DescriptorCounts))
return(common_ratio)
}
ExampleTibble_WithComparedStrings <- example_tibble %>%
rowwise() %>%
mutate(StringsCompared = list(String_Comparison(strings, PatternsList)))
#> Warning: There were 6 warnings in `mutate()`.
#> The first warning was:
#> ℹ In argument: `StringsCompared = list(String_Comparison(strings,
#> PatternsList))`.
#> ℹ In row 1.
#> Caused by warning in `regex()`:
#> ! Coercing `pattern` to a plain character vector.
#> ℹ Run `dplyr::last_dplyr_warnings()` to see the 5 remaining warnings.
ExampleTibble_WithComparedStrings
#> # A tibble: 2 × 3
#> # Rowwise:
#> id strings StringsCompared
#> <int> <list> <list>
#> 1 1 <chr [3]> <dbl [0]>
#> 2 2 <chr [2]> <dbl [0]>
###Returns NotANumber, which is not what I expect
###Isolating DescriptorCounts to demonstrate issue
DescriptorCounts <- function(x, PatternsList) {
foreach(i = seq_along(PatternsList)) %do% {
sum(str_detect(x, regex(pattern = PatternsList[[i]], ignore_case = TRUE)))
}
}
###Will generate lists of [0,0,0]
Output <- example_tibble %>%
rowwise() %>%
mutate(Output = list(DescriptorCounts(x = strings, PatternsList = PatternsList)))
Output$Output
#> [[1]]
#> [[1]][[1]]
#> [1] 1
#>
#> [[1]][[2]]
#> [1] 2
#>
#> [[1]][[3]]
#> [1] 1
#>
#>
#> [[2]]
#> [[2]][[1]]
#> [1] 2
#>
#> [[2]][[2]]
#> [1] 0
#>
#> [[2]][[3]]
#> [1] 0
###Okay the actual values are in there, but irretrievable?????
Created on 2024-02-04 with reprex v2.1.0 However, when I use rstudio I get the following:
ExampleTibble_WithComparedStrings <- example_tibble %>%
rowwise() %>%
+ mutate(StringsCompared = list(String_Comparison(strings, PatternsList)))
> ExampleTibble_WithComparedStrings
#A tibble: 2 × 3
#Rowwise:
id strings StringsCompared
<int> <list> <list>
1 1 <chr [3]> <dbl [0]>
2 2 <chr [2]> <dbl [0]>
> ###Returns NotANumber, which is not what I expect
> ###Isolating DescriptorCounts to demonstrate issue
> DescriptorCounts <- function(x, PatternsList) {
+ foreach(i = seq_along(PatternsList)) %do% {
+ sum(str_detect(x, regex(pattern = PatternsList[[i]], ignore_case = TRUE)))
+ }
+ }
> ###Will generate lists of [0,0,0]
> Output <- example_tibble %>%
+ rowwise() %>%
+ mutate(DescriptorCounts = list(DescriptorCounts(x = strings, PatternsList = PatternsList)))
> Output$DescriptorCounts
[[1]]
[[1]][[1]]
[1] 0
[[1]][[2]]
[1] 0
[[1]][[3]]
[1] 0
[[2]]
[[2]][[1]]
[1] 0
[[2]][[2]]
[1] 0
[[2]][[3]]
[1] 0
The 1st version of Output$Output is actually what I expected. That is to say that within the list at example_tibble$strings[[1]] "dog" is present once, "cat" is present twice, and "bird" is present once. My questions are:
- How do I get rStudio itself to produce the first, correct, output, which I obtained when I ran my code through reprex( )?
- Once that is the case how do I manipulate the lists for DescriptorCounts such that I can take the conditional sum of its contents for each row, divide that by sum(DescriptorCounts), ultimately returning a numeric column with this ratio?
Is it a matter of unlisting the results of DescriptorCounts one more time?
For what it's worth, I have confirmed that the common_descriptors_sum and the common_descriptors_ratio portions of the NoteComparison function code work in principle w/ numeric examples.