Efficient way to implement rule-based value assignment

132 Views Asked by At

I'm trying to come up with an elegant, rule-based way to assign codes to rows in a data frame based on combinations of values in columns, using this data:

library(tidyr)
df <- crossing(yr2018=c("M","S","W"),
               yr2019=c("M","S","W"),
                yr2020=c("M","S","W")) %>%
  print(n=27)

# A tibble: 27 × 3
   yr2018 yr2019 yr2020
   <chr>  <chr>  <chr> 
 1 M      M      M     
 2 M      M      S     
 3 M      M      W     
 4 M      S      M     
 5 M      S      S     
 6 M      S      W     
 7 M      W      M     
 8 M      W      S     
 9 M      W      W     
10 S      M      M     
11 S      M      S     
12 S      M      W     
13 S      S      M     
14 S      S      S     
15 S      S      W     
16 S      W      M     
17 S      W      S     
18 S      W      W     
19 W      M      M     
20 W      M      S     
21 W      M      W     
22 W      S      M     
23 W      S      S     
24 W      S      W     
25 W      W      M     
26 W      W      S     
27 W      W      W     
>

What I want to end up with is a column with codes applied with rules such the following:

  • if all 3 values in yr2018, yr2019, and yr2020 are the same (MMM, SSS, or WWW), then set the new column value to the concatenation of "CON" and whatever the unique value is, so either "CONM", "CONS", or "CONW".
  • if the first and third columns are the same, but the second is different, then concatenate the two unique values together as exactly "MS","MW", or "SW", in that order, depending on which two unique values are in the row, regardless of the order of the values in the columns.
  • if all three are different, regardless of order, then "MSW"
  • if the last two are the same, but different from the first, then concatenate "CON" with the last value, so either "CONM", "CONS", or "CONW"
  • lastly, if the first two are the same and the last different, then concatenate "CON" with the first column, so either "CONM", "CONS", or "CONW"

This feels like a big, ugly if statement, but I'm hoping for something more elegant, especially since my real data is actually 4x5 (625 rows). It also feels like maybe regular expressions, which I struggle with.

I started looking into row-wise functions and found rowwise() as a start to logically reconfigure the data frame, but it looks like the number of functions that can operate that way are limited.

All guidance welcome!

4

There are 4 best solutions below

5
jpsmith On BEST ANSWER

You can use mutate and case_when to efficiently satisfy these conditions. sort in the second logic will organize the letters as you described.

Since case_when evaluates iteratively, you may be able to parse this down to make it more elegant, but as written it should follow your exact conditions:

library(dplyr)

df %>%
  rowwise() %>%
  mutate(new_column = case_when(
    yr2018 == yr2019 & yr2019 == yr2020 ~ paste0("CON", yr2018),
    yr2018 == yr2020 ~ paste(sort(c(yr2019, yr2020)), collapse = ""),
    yr2018 != yr2019 & yr2019 != yr2020 & yr2018 != yr2020 ~ "MSW",
    yr2019 == yr2020 & yr2018 != yr2020 ~ paste0("CON", yr2020),
    yr2018 == yr2019 & yr2018 != yr2020 ~ paste0("CON", yr2018)
  )) 

Output:

   yr2018 yr2019 yr2020 new_column
   <chr>  <chr>  <chr>  <chr>     
 1 M      M      M      CONM      
 2 M      M      S      CONM      
 3 M      M      W      CONM      
 4 M      S      M      MS        
 5 M      S      S      CONS      
 6 M      S      W      MSW       
 7 M      W      M      MW        
 8 M      W      S      MSW       
 9 M      W      W      CONW      
10 S      M      M      CONM      
11 S      M      S      MS        
12 S      M      W      MSW       
13 S      S      M      CONS      
14 S      S      S      CONS      
15 S      S      W      CONS      
16 S      W      M      MSW       
17 S      W      S      SW        
18 S      W      W      CONW      
19 W      M      M      CONM      
20 W      M      S      MSW       
21 W      M      W      MW        
22 W      S      M      MSW       
23 W      S      S      CONS      
24 W      S      W      SW        
25 W      W      M      CONW      
26 W      W      S      CONW      
27 W      W      W      CONW
8
Onyambu On

You could use str_replace:

df %>%
 mutate(new_column = str_replace(exec(str_c, !!!.),".*?(.)\\1+.*", "CON\\1")%>%
     str_replace('((.).)\\2', "\\1"))

# A tibble: 27 × 4
   yr2018 yr2019 yr2020 new_column
   <chr>  <chr>  <chr>  <chr>     
 1 M      M      M      CONM      
 2 M      M      S      CONM      
 3 M      M      W      CONM      
 4 M      S      M      MS        
 5 M      S      S      CONS      
 6 M      S      W      MSW       
 7 M      W      M      MW        
 8 M      W      S      MWS       
 9 M      W      W      CONW      
10 S      M      M      CONM    

You could also use gsubfn::gsubfn:

 df %>%
   mutate(newcol = gsubfn::gsubfn(".*(.)\\1+.*|((.).)\\3", 
                     function(x,y,z)if(nzchar(z))y else str_c('CON', x), 
                           exec(str_c, !!!.), backref = -3))
# A tibble: 27 × 4
   yr2018 yr2019 yr2020 newcol
   <chr>  <chr>  <chr>  <chr> 
 1 M      M      M      CONM  
 2 M      M      S      CONM  
 3 M      M      W      CONM  
 4 M      S      M      MS    
 5 M      S      S      CONS  
 6 M      S      W      MSW   
 7 M      W      M      MW    
 8 M      W      S      MWS   
 9 M      W      W      CONW  
10 S      M      M      CONM  
1
hygtfrde On

This may be a better solution for larger data frames since we can fine-tune each rule, besides the case_when and str_replace answers already posted. The gather() method can convert the data frame into a longer format with 3 columns, adding a 'year' and 'value'. Then we can use rowwise to apply conditions to each row. The 5 rules are applied with case_when based on our new columns year and value. Then ungroup() data back into its original form.

library(tidyr)
library(dplyr)

df <- crossing(yr2018 = c("M", "S", "W"),
               yr2019 = c("M", "S", "W"),
               yr2020 = c("M", "S", "W"))

df <- df %>%
  gather(year, value) %>%
  rowwise() %>%
  mutate(new_column = case_when(
    # Rule 1: All three values are the same
    all(value == value[1]) ~ paste0("CON", value[1]),

    # Rule 2: First and third columns are the same
    year[1] == year[3] & value[1] != value[2] ~ paste0(value[1], value[2]),
    year[1] == year[3] & value[1] != value[3] ~ paste0(value[1], value[3]),
    year[2] == year[3] & value[2] != value[1] ~ paste0(value[2], value[1]),

    # Rule 3: All three values are different
    all(value != value[1]) ~ "MSW",

    # Rule 4: Last two values are the same
    value[2] == value[3] & value[1] != value[2] ~ paste0("CON", value[2]),
    value[1] == value[3] & value[1] != value[2] ~ paste0("CON", value[3]),

    # Rule 5: First two values are the same
    value[1] == value[2] & value[1] != value[3] ~ paste0("CON", value[1])
  )) %>%
  ungroup() %>%
  select(-year, -value)

print(df)
1
GKi On

A way could be to use rle and in case there are consecutive paste this to CON otherwise sort the unique values.

sapply(apply(df, 1, rle, simplify = FALSE), \(x)
       if(is.na(i <- which(x$lengths > 1)[1]))
           paste(sort(unique(x$values)), collapse="")
       else  paste0("CON", x$value[i]) )
# [1] "CONM" "CONM" "CONM" "MS"   "CONS" "MSW"  "MW"   "MSW"  "CONW" "CONM"
#[11] "MS"   "MSW"  "CONS" "CONS" "CONS" "MSW"  "SW"   "CONW" "CONM" "MSW" 
#[21] "MW"   "MSW"  "CONS" "SW"   "CONW" "CONW" "CONW"