I'm working on automating toxicology read-in and import for unique case IDs from flat .csv files in R/RStudio. I've been asked to generate a spreadsheet that compares the differences between the manual toxicology process and the automated process.

I have a data.frame/tibble with a casenumber (the unique ID) column, a column containing the substances created during the automated read-in (auto.tox in the example), and a column containing the current, "production" toxicology substances that were created manually by someone reading and physically typing in substances that returned a result (production.tox in the example).

I need to compare the "auto.tox" column and the "production.tox" columns below to assess for precise string differences between each column in each row (for each casenumber). I believe example data will better help illustrate what I mean.

df <- tribble(
  ~casenumber, ~auto.tox, ~production.tox,
  "2023-1",   c("Gabapentin, Alprazolam, Ethanol"), c("Alprazolam, Gabapentin, Ethanol"),
  "2023-2",   c("Alprazolam, Fentanyl, Ethanol"), c("Fentanyl, Ethanol"),
  "2023-3",   c("Fentanyl, Norfentanyl, 4-ANPP, Acetyl Fentanyl"), c("Fentanyl, Norfentanyl"))

I would like to add a new column called "different_substances," that contains any substance(s) present in the "auto.tox" column that are not present in the "production.tox" column for each casenumber.

I've tried:

  • using dplyr::anti_join() with separate data.frames with substances by case and joining on casenumber
  • using list(setdiff(auto.tox, production.tox))
  • mapply(vecsets::vsetdiff())

None of which gave me the precise output I'm looking for.

The desired output is:

casenumber  different_substances
"2023-1"      NA or "None" #(order doesn't matter)
"2023-2"     "Alprazolam"
"2023-3"     "4-ANPP, Acetyl Fentanyl"

I realize the last substance for 2023-3 should be spelled "acetylfentanyl" but our lookup table has a few misspellings, so the separate word is important, and it's important for "Fentanyl" alone to be recognized as different than "Acetyl Fentanyl".

I'm open to any general suggestions as well and also changing the format of the two substance columns to a more efficient format if needed.

If anyone can help, I'd greatly appreciate it. Thank you.

2

There are 2 best solutions below

0
user438383 On BEST ANSWER

Here is one approach:

library(tidyverse)
# function to split the columns and remove whitespace

f_split = function(x) {
    lapply(str_split(x, ","), str_trim)
}

## function to find chemicals not in one set
overlap = function(x, y) x[!x %in% y]

# apply f_split
auto_tox = f_split(df$auto.tox)
production_tox = f_split(df$production.tox)


df %>% mutate(diff = sapply(mapply(overlap, auto_tox, production_tox), paste0, collapse=", ")) %>%
    mutate(diff = if_else(diff == "", "None", diff))
# A tibble: 3 × 5
  casenumber auto.tox
  <chr>      <chr>
1 2023-1     Gabapentin, Alprazolam, Ethanol
2 2023-2     Alprazolam, Fentanyl, Ethanol
3 2023-3     Fentanyl, Norfentanyl, 4-ANPP, Acetyl Fentanyl
  production.tox                  difff
  <chr>                           <chr>
1 Alprazolam, Gabapentin, Ethanol ""
2 Fentanyl, Ethanol               "Alprazolam"
3 Fentanyl, Norfentanyl           "4-ANPP,Acetyl Fentanyl"
  diff
  <chr>
1 None
2 Alprazolam
3 4-ANPP,Acetyl Fentanyl
2
LMc On
library(dplyr)
library(stringr)

df |>
  rowwise() |>
  mutate(different_substances = setdiff(str_split_1(auto.tox, ', '), 
                                        str_split_1(production.tox, ', ')) |>
           str_flatten_comma() |>
           na_if(x = _, "")) |>
  ungroup()

I would recommend using NA instead of "None" since R has more functionality for dealing with NA.

Output

  casenumber auto.tox                                       production.tox                  different_substances   
  <chr>      <chr>                                          <chr>                           <chr>                  
1 2023-1     Gabapentin, Alprazolam, Ethanol                Alprazolam, Gabapentin, Ethanol NA                     
2 2023-2     Alprazolam, Fentanyl, Ethanol                  Fentanyl, Ethanol               Alprazolam             
3 2023-3     Fentanyl, Norfentanyl, 4-ANPP, Acetyl Fentanyl Fentanyl, Norfentanyl           4-ANPP, Acetyl Fentanyl