R inner join based on partial matching between 2 string vectors

76 Views Asked by At

I have 2 large dataframes respectively with the following columns:

> colnames(uci_ce_inv) [1] "Ndg" "CTP" "CNT_ID" "NOMINAL_EUR" "FXD_CCYPAIR" "ISIN" "SICOVAM" "TRN_ID" "SOURCE" "INCLUDED" [11] "EXCLUDED"

> colnames(assess_data) [1] "GROUP_SOURCE" "GROUP" "MD_ISS_NAME" "MD_ISS_RATING" "MD_GROUP" "MD_SECTOR" "MD_CCY" "MD_COUNTRY" [9] "MD_REGION" "MD_SEC_TYPE"

I'd like to perform an inner join based on the partial match between uci_ce_inv$TRN_ID and assess_data$GROUP. First of all, in order to do that I define relevant vectors for matching as characters:

uci_ce_inv$TRN_ID <- as.character(uci_ce_inv$TRN_ID) assess_data$GROUP <- as.character(assess_data$GROUP)

I tried different ways to perform the matching but the resulting daframe is always empty:

method 1: uci_ce_inv <- uci_ce_inv %>% fuzzy_inner_join(assess_data, by = c("TRN_ID"="GROUP"), match_fun=stri_detect_fixed)

method 2: uci_ce_inv <- uci_ce_inv %>% fuzzy_inner_join(assess_data, by = c("TRN_ID"="GROUP"), match_fun=str_detect)

method 3: uci_ce_inv <- uci_ce_inv %>% regex_inner_join(assess_data, by = c("TRN_ID"="GROUP"))

method 4: `partial_join <- function(x, y, by_x, pattern_y) { idx_x <- sapply(y[[pattern_y]], grep, x[[by_x]]) idx_y <- sapply(seq_along(idx_x), function(i) rep(i, length(idx_x[[i]])))

  df <- dplyr::bind_cols(x[unlist(idx_x), , drop = F],
                     y[unlist(idx_y), , drop = F])
  return(df)

} a <- partial_join(uci_ce_inv,assess_data, by_x="TRN_ID",pattern_y="GROUP") `

What's going wrong with this codes? Thanks in advance!

e.g.

` uci_ce_inv$TRN_ID=123456789

assess_data$GROUP=ewitwr/kdk/123456789/gfhdhfd`

all columns of assess_data for matchig rows should be added in the related uci_ce_inv row

1

There are 1 best solutions below

7
On

If your tables are `data.table~ objects, you can use this :

result_dt <- uci_ce_inv[assess_data, on = .(TRN_ID = GROUP), nomatch = 0]

The nomatch = 0 ensures the inner join, otherwise you get a left join on assess_data.

If your tables are data frames, this should work:

result_dt <- inner_join(uci_ce_inv, assess_data, by = c("TRN_ID" = "GROUP"))