I am using R
to work with some dataframes. My issue is related on how to check if values in a variable in a first dataframe match with values in another dataframe. The match is very different to those like merge
or join
. I will introduce my dataframes (dput()
at end):
My first dataframe is df1
. It contains the variable name
which I want to contrast with other variable in a second dataframe. It looks like this:
df1
name
1 JUAN GIRON
2 GINA OLEAS
3 JUAN FERNANDO ELIZAGA
4 MARCO TORRES
5 JUAN PABLO GONZALEZ
6 IRMA GOMEZ
The second dataframe is df2
. It also contains a variable name
which will be used to the contrast with name
from df1
. It looks like this (In a real situation df2
can be very large with more than 1000 rows):
df2
name val
1 JUANA MARQUEZ 1
2 FERNANDO ELIZAGA 2
3 IRMA GOMEZ 3
4 PABLO GONZALEZ 4
5 GINA LUCIO 5
6 MARK TORRES 6
7 LETICIA BLACIO 7
8 JUAN PABLO GIRON BELTRAN 8
I am looking for a way to check if every row of df1
for name
variable is contained or match with any value for name
in df2
. For example, the value JUAN GIRON
after checking with name
from df2
should return, give a value of yes
because it is contained in the string JUAN PABLO GIRON BELTRAN
from df2
. The same case would apply for the other values. In the end I would like to have something like this:
df3
name val
1 JUAN GIRON yes
2 GINA OLEAS no
3 JUAN FERNANDO ELIZAGA yes
4 MARCO TORRES no
5 JUAN PABLO GONZALEZ yes
6 IRMA GOMEZ yes
How can I reach that result? I have tried with grepl()
concatenating the strings using |
but it is not working because some values are returning a yes
match when there is not match.
Also, as data can be large, I would like to have a solution with dplyr
because the comparison is by row so it can be slow. Or any fast solution is welcome. Many thanks!
Data is next:
#df1
df1 <- structure(list(name = c("JUAN GIRON", "GINA OLEAS", "JUAN FERNANDO ELIZAGA",
"MARCO TORRES", "JUAN PABLO GONZALEZ", "IRMA GOMEZ")), row.names = c(NA,
-6L), class = "data.frame")
#df2
df2 <- structure(list(name = c("JUANA MARQUEZ", "FERNANDO ELIZAGA",
"IRMA GOMEZ", "PABLO GONZALEZ", "GINA LUCIO", "MARK TORRES",
"LETICIA BLACIO", "JUAN PABLO GIRON BELTRAN"), val = 1:8), row.names = c(NA,
-8L), class = "data.frame")
Here's an approach that uses a regex pattern and handles names that are either length 2 or 3. There's room for improvement, and I'd love to read other answers to this question.
Alternatively, after defining
b
using the loop above:Result: