Create column based on partial match of strings between tibbles

Question

Create column based on partial match of strings between tibbles

140 Views Asked by jfaccioni At 04 April 2025 at 14:10

I have these tibbles:

library(tidyverse)

data <- tribble(
                 ~code, ~v1, ~v2, ~v3,
  'ENSG00000141510.14',  10,  20,  30,
  'ENSG00000133703.10',  15,  90,  50,
  'ENSG00000187848.11',  18,  17,  16,
  'ENSG00000133703.10',  55,  47,  22
)

code_info <- tribble(
              ~code,   ~name,
  'ENSG00000141510',   'p53',
  'ENSG00000133703',  'KRAS',
  'ENSG00000187848', 'P2XR2'
)

I want to create a new column data$name by comparing the columns data$code and code_info$code. In this sense, the code_info tibble kinda works as a key:value pair for the strings in data$code.

Each row in data will have its name column equal to the name column in code_info whose string in code_info$code is a partial string of the one in data$code.

In this example, the expected result would be:

 # A tibble: 4 x 5
  code                  v1    v2    v3 name 
  <chr>              <dbl> <dbl> <dbl> <chr>
1 ENSG00000141510.14    10    20    30 p53  
2 ENSG00000133703.10    15    90    50 KRAS 
3 ENSG00000187848.11    18    17    16 P2XR2
4 ENSG00000133703.10    55    47    22 KRAS

So far I've been doing this by manually hardcoding a bunch of sequential str_replace, in a mutate, like:

data %>% 
  mutate(
    name = str_replace(code, "ENSG00000141510.14", "p53"), 
    name = str_replace(name, "ENSG00000133703.10", "KRAS"), 
    name = str_replace(name, "ENSG00000187848.11", "P2XR2")
  )
# output:
# # A tibble: 4 x 5
# code                  v1    v2    v3 name 
# <chr>              <dbl> <dbl> <dbl> <chr>
# 1 ENSG00000141510.14    10    20    30 p53  
# 2 ENSG00000133703.10    15    90    50 KRAS 
# 3 ENSG00000187848.11    18    17    16 P2XR2
# 4 ENSG00000133703.10    55    47    22 KRAS

Which does the trick but obviously isn't very scalable. Is there a way to incorporate the information in code_info, so that this is done automatically, without hard-coding the replacement values?

Original Q&A

There are 1 best solutions below

**Duck** · Accepted Answer

Try this sloution with left_join() and separate() in order to create a common variable between datasets:

library(tidyverse)
#Data
data <- tribble(
  ~code, ~v1, ~v2, ~v3,
  'ENSG00000141510.14',  10,  20,  30,
  'ENSG00000133703.10',  15,  90,  50,
  'ENSG00000187848.11',  18,  17,  16,
  'ENSG00000133703.10',  55,  47,  22
)

code_info <- tribble(
  ~code,   ~name,
  'ENSG00000141510',   'p53',
  'ENSG00000133703',  'KRAS',
  'ENSG00000187848', 'P2XR2'
)
#Code
data2 <- data %>% mutate(Dup=code) %>%
  separate(Dup,c('V1','V2'),sep = '\\.') %>% select(-V2) %>%
  left_join(code_info %>% rename(V1=code)) %>% select(-V1)

Output:

# A tibble: 4 x 5
  code                  v1    v2    v3 name 
  <chr>              <dbl> <dbl> <dbl> <chr>
1 ENSG00000141510.14    10    20    30 p53  
2 ENSG00000133703.10    15    90    50 KRAS 
3 ENSG00000187848.11    18    17    16 P2XR2
4 ENSG00000133703.10    55    47    22 KRAS

Create column based on partial match of strings between tibbles

There are 1 best solutions below

Related Questions in R

Related Questions in TIDYVERSE

Related Questions in STRINGR

Related Questions in DPLYR

Related Questions in TIBBLE

Trending Questions

Popular # Hahtags

Popular Questions