tidyverse/dplyr solution for str_detect case/mutate

71 Views Asked by At

I have seen snippets of this floating around but sadly no full answers as of yet so thought I would ask.

I'm working on a function to assign a value based off the presence or absence of some key words ranked by severity. Like such:

severity <- c("kw1", "kw2", "kw3", "kw4", "kw5", "kw6")

Where it basically goes through a single column in a dataset and assigns a value based on the presence or absence of the first/most severe entry in the severity list.

Using the following, I realized you could detect multiple strings with str_detect:

How can I check if multiple strings exist in another string?

severity_rankings <- severity_df |>
  dplyr::mutate(
    # Classify severity based on strings
    severity_kw = dplyr::case_when(
      if (any(stringr::str_detect(tolower(severity_string),severity))) ~ severity[min(which(str_detect(tolower(severity_string),severity) == TRUE))],
      .default = NA
    ))

But this keeps throwing an error like it's trying to parse the whole column:

Error in `dplyr::mutate()`:
ℹ In argument: `severity_kw = dplyr::case_when(...)`.
Caused by error in `stringr::str_detect()`:
! Can't recycle `string` (size 20) to match `pattern` (size 6).
Run `rlang::last_trace()` to see where the error occurred.

Ultimately, what I would like is an output along these lines:

ID          severity_string severity_kw
1     kw1 with KW2 and kw6         kw1
2                      kw6         kw6
3   kw6 with kW5, kw2 also         kw2
4                      KW3         kw3
5                      KW5         kw5
6   KW4 with kw2, kw1 also         kw1
7                      KW1         kw1
8                      KW2         kw2
9             KW4 with KW5         kw4
10                      KW6         kw6
11 KW6 with KW1 on the side         kw1
12     KW2 with KW4 and KW1         kw1
13             kw5 with kw6         kw5
14                      kw7        <NA>
15              KW3 and KW2         kw2
16                      KW2         kw2
17              KW1 and KW6         kw1
18                      KW3         kw3
19              KW3 and KW1         kw1
20                      kw1         kw1

I'm sure it's bad syntax or the wrong dplyr call on my part, but not sure where to start. Any and all advice would be appreciated.

For generating the initial dataframe:

severity_df <- data.frame(
    ID = c(1:20), 
    severity_string = c("kw1 with KW2 and kw6", "kw6", "kw6 with kW5, kw2 also", "KW3", "KW5",
                        "KW4 with kw2, kw1 also", "KW1", "KW2", "KW4 with KW5", "KW6",
                        "KW6 with KW1 on the side", "KW2 with KW4 and KW1", "kw5 with kw6", "kw7", "KW3 and KW2",
                        "KW2", "KW1 and KW6", "KW3", "KW3 and KW1", "kw1"),
    stringsAsFactors = FALSE
)
1

There are 1 best solutions below

7
wurli On

The issue is that you're trying to use str_detect() with values for string and input which have incompatible lengths. You can reproduce the error like this:

str_detect(c("foo", "bar"), c("foo", "bar", "baz"))
#> Error in `str_detect()`:
#> ! Can't recycle `string` (size 2) to match `pattern` (size 3).
#> Run `rlang::last_trace()` to see where the error occurred.

I think you've got a misplaced if in there too, but that seems to be besides the point of the question.

For your use case, I would change tack and use a tool like map_chr() with a bespoke function:

severity_df |>
  mutate(
    severity_kw = severity_string |>

      # For each value of severity_string...
      purrr::map_chr(function(x) {
        
        # For each value of severity...
        for (pattern in severity) {
          
          # Return the value of severity, if there's a match
          if (str_detect(x, regex(pattern, ignore_case = TRUE))) {
            return(pattern)
          }
        }
        
        # If no values match, return NA
        NA_character_
      })
  )