Create column based on partial match of strings between tibbles

143 Views Asked by At

I have these tibbles:

library(tidyverse)

data <- tribble(
                 ~code, ~v1, ~v2, ~v3,
  'ENSG00000141510.14',  10,  20,  30,
  'ENSG00000133703.10',  15,  90,  50,
  'ENSG00000187848.11',  18,  17,  16,
  'ENSG00000133703.10',  55,  47,  22
)

code_info <- tribble(
              ~code,   ~name,
  'ENSG00000141510',   'p53',
  'ENSG00000133703',  'KRAS',
  'ENSG00000187848', 'P2XR2'
)

I want to create a new column data$name by comparing the columns data$code and code_info$code. In this sense, the code_info tibble kinda works as a key:value pair for the strings in data$code.

Each row in data will have its name column equal to the name column in code_info whose string in code_info$code is a partial string of the one in data$code.

In this example, the expected result would be:

 # A tibble: 4 x 5
  code                  v1    v2    v3 name 
  <chr>              <dbl> <dbl> <dbl> <chr>
1 ENSG00000141510.14    10    20    30 p53  
2 ENSG00000133703.10    15    90    50 KRAS 
3 ENSG00000187848.11    18    17    16 P2XR2
4 ENSG00000133703.10    55    47    22 KRAS

So far I've been doing this by manually hardcoding a bunch of sequential str_replace, in a mutate, like:

data %>% 
  mutate(
    name = str_replace(code, "ENSG00000141510.14", "p53"), 
    name = str_replace(name, "ENSG00000133703.10", "KRAS"), 
    name = str_replace(name, "ENSG00000187848.11", "P2XR2")
  )
# output:
# # A tibble: 4 x 5
# code                  v1    v2    v3 name 
# <chr>              <dbl> <dbl> <dbl> <chr>
# 1 ENSG00000141510.14    10    20    30 p53  
# 2 ENSG00000133703.10    15    90    50 KRAS 
# 3 ENSG00000187848.11    18    17    16 P2XR2
# 4 ENSG00000133703.10    55    47    22 KRAS

Which does the trick but obviously isn't very scalable. Is there a way to incorporate the information in code_info, so that this is done automatically, without hard-coding the replacement values?

1

There are 1 best solutions below

1
On BEST ANSWER

Try this sloution with left_join() and separate() in order to create a common variable between datasets:

library(tidyverse)
#Data
data <- tribble(
  ~code, ~v1, ~v2, ~v3,
  'ENSG00000141510.14',  10,  20,  30,
  'ENSG00000133703.10',  15,  90,  50,
  'ENSG00000187848.11',  18,  17,  16,
  'ENSG00000133703.10',  55,  47,  22
)

code_info <- tribble(
  ~code,   ~name,
  'ENSG00000141510',   'p53',
  'ENSG00000133703',  'KRAS',
  'ENSG00000187848', 'P2XR2'
)
#Code
data2 <- data %>% mutate(Dup=code) %>%
  separate(Dup,c('V1','V2'),sep = '\\.') %>% select(-V2) %>%
  left_join(code_info %>% rename(V1=code)) %>% select(-V1)

Output:

# A tibble: 4 x 5
  code                  v1    v2    v3 name 
  <chr>              <dbl> <dbl> <dbl> <chr>
1 ENSG00000141510.14    10    20    30 p53  
2 ENSG00000133703.10    15    90    50 KRAS 
3 ENSG00000187848.11    18    17    16 P2XR2
4 ENSG00000133703.10    55    47    22 KRAS