Complex join between two dataframes

58 Views Asked by At

I am working on a very advanced join of dataframes that is complex for me. I would like to ask you for some help if possible. I have two dataframes, df1 and df2 which I include at the end as dput(). My first dataframe df1 looks like this:

df1
                             name Var
1   RODRIGUEZ PAREDES MARIA BELEN   1
2  VALLEJO JANETA PAOLO ALEXANDER   1
3             MORALES JADAN DIANA   1
4            FREIRE PASPUEL BYRON   1
5             ORTIZ PRADO ESTEBAN   1
6      HENRIQUEZ TRUJILLO AQUILES   1
7            RIVERA OLIVERO ISMAR   1
8               JARAMILLO TATIANA   1
9                   LOZADA TANNYA   1
10 GARCIA BEREGUIAIN MIGUEL ANGEL   1

It is mainly composed of latin names and a variable.

The second dataframe df2 looks like this:

df2
                             name Val1 Val2
1   RODRIGUEZ PAREDES MARIA BELEN    a    b
2           RODRIGUEZ MARIA BELEN    c    b
3  VALLEJO JANETA PAOLO ALEXANDER    a    a
4               VALLEJO ALEXANDER    b    b
5             MORALES JADAN DIANA    a    a
6            FREIRE PASPUEL BYRON    d    c
7                    FREIRE BYRON    a    c
8             ORTIZ PRADO ESTEBAN    a    a
9             ORTIZ-PRADO ESTEBAN    a    a
10     HENRIQUEZ TRUJILLO AQUILES    b    b
11              HENRIQUEZ AQUILES    a    b
12                   HENRIQUEZ A.    c    c
13      JARAMILLO VIVANCO TATIANA    a    a
14                   JARAMILLO T.    a    b
15                  LOZADA TANNYA    a    a
16 GARCIA BEREGUIAIN MIGUEL ANGEL    b    b
17            GARCIA MIGUEL ANGEL    a    a

This dataframe is essential because it contains additional information. Now, I will describe my main issue. I need to join these two dataframes in a complex task and compute a variable about the number of similar observations. Both of them have the key name which will be used for the merge but the join is very troublesome. I will explain better with an example. Let’s take the name RODRIGUEZ PAREDES MARIA BELEN from df1. I need to merge with df2 and compute a variable Number which tells how many similar names exist. In this case RODRIGUEZ PAREDES MARIA BELEN is similar/identical to RODRIGUEZ PAREDES MARIA BELEN and RODRIGUEZ MARIA BELEN from df2 so Number should be equal to 2. In addition, after the comparison I need to bring the variables that match the name. So for RODRIGUEZ PAREDES MARIA BELEN we would have a and b in Val1 and Val2.

This is a complex computation for the variable name and also I do not know which kind of join I should use to bring the other variables.

Also, there is a consideration. For example in the case of name JARAMILLO TATIANA if we compare with df2, we cannot find it. The variable Number should be 2 because there are two similar names, but in the case of the variables Val1 and Val2, they must contain the values of the first closest/identical string found in df2. So for this name, we would have Val1=a and Val2=a because the similar match was found with JARAMILLO VIVANCO TATIANA in df2.

In the end I would like to have a new dataframe like this:

                             name Var Number Val1 Val2
1   RODRIGUEZ PAREDES MARIA BELEN   1      2    a    b
2  VALLEJO JANETA PAOLO ALEXANDER   1      2    a    a
3             MORALES JADAN DIANA   1      1    a    a
4            FREIRE PASPUEL BYRON   1      2    d    c
5             ORTIZ PRADO ESTEBAN   1      2    a    a
6      HENRIQUEZ TRUJILLO AQUILES   1      3    b    b
7            RIVERA OLIVERO ISMAR   1      0 <NA> <NA>
8               JARAMILLO TATIANA   1      2    a    a
9                   LOZADA TANNYA   1      1    a    a
10 GARCIA BEREGUIAIN MIGUEL ANGEL   1      2    b    b

I have tried with left_join() or merge() but it is not possible to complete all variables, overall Number. Also, I checked fuzzyjoin package but it is not clear for me how to use this. If possible I would prefer a dplyr solution, a quick base solution or a fuzzyjoin solution would be great.

Many thanks. The data is next:

#Data 1
df1 <- structure(list(name = c("RODRIGUEZ PAREDES MARIA BELEN", "VALLEJO JANETA PAOLO ALEXANDER", 
"MORALES JADAN DIANA", "FREIRE PASPUEL BYRON", "ORTIZ PRADO ESTEBAN", 
"HENRIQUEZ TRUJILLO AQUILES", "RIVERA OLIVERO ISMAR", "JARAMILLO TATIANA", 
"LOZADA TANNYA", "GARCIA BEREGUIAIN MIGUEL ANGEL"), Var = c(1L, 
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L)), class = "data.frame", row.names = c(NA, 
-10L))

#Data 2
df2 <- structure(list(name = c("RODRIGUEZ PAREDES MARIA BELEN", "RODRIGUEZ MARIA BELEN", 
"VALLEJO JANETA PAOLO ALEXANDER", "VALLEJO ALEXANDER", "MORALES JADAN DIANA", 
"FREIRE PASPUEL BYRON", "FREIRE BYRON", "ORTIZ PRADO ESTEBAN", 
"ORTIZ-PRADO ESTEBAN", "HENRIQUEZ TRUJILLO AQUILES", "HENRIQUEZ AQUILES", 
"HENRIQUEZ A.", "JARAMILLO VIVANCO TATIANA", "JARAMILLO T.", 
"LOZADA TANNYA", "GARCIA BEREGUIAIN MIGUEL ANGEL", "GARCIA MIGUEL ANGEL"
), Val1 = c("a", "c", "a", "b", "a", "d", "a", "a", "a", "b", 
"a", "c", "a", "a", "a", "b", "a"), Val2 = c("b", "b", "a", "b", 
"a", "c", "c", "a", "a", "b", "b", "c", "a", "b", "a", "b", "a"
)), class = "data.frame", row.names = c(NA, -17L))
1

There are 1 best solutions below

0
On

Here is a suggestion. It is not the eaxact solution. But I think it will bring you further:

You will notice that it differs. But You can go through the code line by line and play with max_dist = .2 or max_dist = .2 or try other method etc...

By going through the lines you will see which name of df2 is matched to name df1:

library(dplyr)
library(fuzzyjoin)

fuzzyjoin::stringdist_left_join(x=df1, y=df2, max_dist = .2, 
                                by="name", 
                                method = 'jaccard', 
                                distance_col = "dist") %>%  
  mutate(id = row_number()) %>% 
  group_by(name.x) %>%   
  add_count(name="Number") %>% 
  mutate(Number= ifelse(is.na(dist), 0, Number)) %>% 
  arrange(id) %>% 
  filter(dist == min(dist) | is.na(dist))
  name                             Var Number Val1  Val2 
   <chr>                          <int>  <dbl> <chr> <chr>
 1 RODRIGUEZ PAREDES MARIA BELEN      1      2 a     b    
 2 VALLEJO JANETA PAOLO ALEXANDER     1      2 a     a    
 3 MORALES JADAN DIANA                1      1 a     a    
 4 FREIRE PASPUEL BYRON               1      1 d     c    
 5 ORTIZ PRADO ESTEBAN                1      2 a     a    
 6 HENRIQUEZ TRUJILLO AQUILES         1      2 b     b    
 7 RIVERA OLIVERO ISMAR               1      0 NA    NA   
 8 JARAMILLO TATIANA                  1      2 a     a    
 9 LOZADA TANNYA                      1      1 a     a    
10 GARCIA BEREGUIAIN MIGUEL ANGEL     1      2 b     b