I am using R to work with some dataframes. My issue is related on how to check if values in a variable in a first dataframe match with values in another dataframe. The match is very different to those like merge or join. I will introduce my dataframes (dput() at end):
My first dataframe is df1. It contains the variable name which I want to contrast with other variable in a second dataframe. It looks like this:
df1
name
1 JUAN GIRON
2 GINA OLEAS
3 JUAN FERNANDO ELIZAGA
4 MARCO TORRES
5 JUAN PABLO GONZALEZ
6 IRMA GOMEZ
The second dataframe is df2. It also contains a variable name which will be used to the contrast with name from df1. It looks like this (In a real situation df2 can be very large with more than 1000 rows):
df2
name val
1 JUANA MARQUEZ 1
2 FERNANDO ELIZAGA 2
3 IRMA GOMEZ 3
4 PABLO GONZALEZ 4
5 GINA LUCIO 5
6 MARK TORRES 6
7 LETICIA BLACIO 7
8 JUAN PABLO GIRON BELTRAN 8
I am looking for a way to check if every row of df1 for name variable is contained or match with any value for name in df2. For example, the value JUAN GIRON after checking with name from df2 should return, give a value of yes because it is contained in the string JUAN PABLO GIRON BELTRAN from df2. The same case would apply for the other values. In the end I would like to have something like this:
df3
name val
1 JUAN GIRON yes
2 GINA OLEAS no
3 JUAN FERNANDO ELIZAGA yes
4 MARCO TORRES no
5 JUAN PABLO GONZALEZ yes
6 IRMA GOMEZ yes
How can I reach that result? I have tried with grepl() concatenating the strings using | but it is not working because some values are returning a yes match when there is not match.
Also, as data can be large, I would like to have a solution with dplyr because the comparison is by row so it can be slow. Or any fast solution is welcome. Many thanks!
Data is next:
#df1
df1 <- structure(list(name = c("JUAN GIRON", "GINA OLEAS", "JUAN FERNANDO ELIZAGA",
"MARCO TORRES", "JUAN PABLO GONZALEZ", "IRMA GOMEZ")), row.names = c(NA,
-6L), class = "data.frame")
#df2
df2 <- structure(list(name = c("JUANA MARQUEZ", "FERNANDO ELIZAGA",
"IRMA GOMEZ", "PABLO GONZALEZ", "GINA LUCIO", "MARK TORRES",
"LETICIA BLACIO", "JUAN PABLO GIRON BELTRAN"), val = 1:8), row.names = c(NA,
-8L), class = "data.frame")
Here's an approach that uses a regex pattern and handles names that are either length 2 or 3. There's room for improvement, and I'd love to read other answers to this question.
Alternatively, after defining
busing the loop above:Result: