How to check if values in first dataframe are contained or match values in another dataframe

1k Views Asked by At

I am using R to work with some dataframes. My issue is related on how to check if values in a variable in a first dataframe match with values in another dataframe. The match is very different to those like merge or join. I will introduce my dataframes (dput() at end):

My first dataframe is df1. It contains the variable name which I want to contrast with other variable in a second dataframe. It looks like this:

df1
                   name
1            JUAN GIRON
2            GINA OLEAS
3 JUAN FERNANDO ELIZAGA
4          MARCO TORRES
5   JUAN PABLO GONZALEZ
6            IRMA GOMEZ

The second dataframe is df2. It also contains a variable name which will be used to the contrast with name from df1. It looks like this (In a real situation df2 can be very large with more than 1000 rows):

df2
                      name val
1            JUANA MARQUEZ   1
2         FERNANDO ELIZAGA   2
3               IRMA GOMEZ   3
4           PABLO GONZALEZ   4
5               GINA LUCIO   5
6              MARK TORRES   6
7           LETICIA BLACIO   7
8 JUAN PABLO GIRON BELTRAN   8

I am looking for a way to check if every row of df1 for name variable is contained or match with any value for name in df2. For example, the value JUAN GIRON after checking with name from df2 should return, give a value of yes because it is contained in the string JUAN PABLO GIRON BELTRAN from df2. The same case would apply for the other values. In the end I would like to have something like this:

df3
                   name val
1            JUAN GIRON yes
2            GINA OLEAS  no
3 JUAN FERNANDO ELIZAGA yes
4          MARCO TORRES  no
5   JUAN PABLO GONZALEZ yes
6            IRMA GOMEZ yes 

How can I reach that result? I have tried with grepl() concatenating the strings using | but it is not working because some values are returning a yes match when there is not match.

Also, as data can be large, I would like to have a solution with dplyr because the comparison is by row so it can be slow. Or any fast solution is welcome. Many thanks!

Data is next:

#df1
df1 <- structure(list(name = c("JUAN GIRON", "GINA OLEAS", "JUAN FERNANDO ELIZAGA", 
"MARCO TORRES", "JUAN PABLO GONZALEZ", "IRMA GOMEZ")), row.names = c(NA, 
-6L), class = "data.frame")

#df2
df2 <- structure(list(name = c("JUANA MARQUEZ", "FERNANDO ELIZAGA", 
"IRMA GOMEZ", "PABLO GONZALEZ", "GINA LUCIO", "MARK TORRES", 
"LETICIA BLACIO", "JUAN PABLO GIRON BELTRAN"), val = 1:8), row.names = c(NA, 
-8L), class = "data.frame")
9

There are 9 best solutions below

3
On

Here's an approach that uses a regex pattern and handles names that are either length 2 or 3. There's room for improvement, and I'd love to read other answers to this question.

# Input
a <- strsplit(df2$name, " ")
# Output
b <- c()

# Define regex pattern
for(i in 1:length(a)){
  if(length(a[[i]]) == 3){
    temp <- paste0(
        a[[i]][1], " ", a[[i]][2], "|",
        a[[i]][1], " ", a[[i]][3], "|",
        a[[i]][2], " ", a[[i]][3])
  } else if(length(a[[i]] == 2)){
    temp <- paste(a[[i]], collapse = " ")
  } else {
    stop("Length of split name was not 2 or 3")
  }
  b <- c(b, temp)
}

df1$val <- grepl(paste(b, collapse = "|"), df1$name)

Alternatively, after defining b using the loop above:

library(dplyr)
patt <- paste(b, collapse = "|")
df1 %>%
    mutate(val = grepl(patt, name))

Result:

> df1
                   name    val
1            JUAN GIRON    TRUE
2            GINA OLEAS   FALSE
3 JUAN FERNANDO ELIZAGA    TRUE
4          MARCO TORRES   FALSE
5   JUAN PABLO GONZALEZ    TRUE
6            IRMA GOMEZ    TRUE
3
On

Perhaps we can do like this

df1 %>%
    mutate(val = c("no", "yes")[1 + (rowSums(
        outer(
            strsplit(name, "\\s+"),
            strsplit(df2$name, "\\s+"),
            Vectorize(function(x, y) all(x %in% y) | all(y %in% x))
        )
    ) > 0)])

which gives

                   name val
1            JUAN GIRON yes
2            GINA OLEAS  no
3 JUAN FERNANDO ELIZAGA yes
4          MARCO TORRES  no
5   JUAN PABLO GONZALEZ yes
6            IRMA GOMEZ yes
0
On

The answer by ThomasIsCoding is great. But using outer() is quite memory consuming and not parallelizable. The following solution uses nested map()'s. Also, the furrr package is used to parallelize the outer map().

A benchmark with a much larger df2 shows that both nesting and parallelization yield a substantial speed-up for about double the speed in total.

Data and Packages

library(tidyverse)
library(furrr)

#df1
df1 <- structure(list(name = c("JUAN GIRON", "GINA OLEAS", "JUAN FERNANDO ELIZAGA", 
"MARCO TORRES", "JUAN PABLO GONZALEZ", "IRMA GOMEZ")), row.names = c(NA, 
-6L), class = "data.frame")

#df2
df2 <- structure(list(name = c("JUANA MARQUEZ", "FERNANDO ELIZAGA", 
"IRMA GOMEZ", "PABLO GONZALEZ", "GINA LUCIO", "MARK TORRES", 
"LETICIA BLACIO", "JUAN PABLO GIRON BELTRAN"), val = 1:8), row.names = c(NA, 
-8L), class = "data.frame")

Demo on small data set

plan(multisession, workers = 8) # 8 for my quad-core with hyperthreading

n2 <- df2$name |> 
  str_split("\\s+")

df1 |> 
  mutate(val = name |> 
           str_split("\\s+") |> 
           future_map_lgl(\(n1e) map_lgl(n2, 
                                 \(n2e) all(n1e %in% n2e) | all(n2e %in% n1e)
                                 ) |> any()
                      ) |> 
           factor(labels = c("no", "yes"))
         )
#>                    name val
#> 1            JUAN GIRON yes
#> 2            GINA OLEAS  no
#> 3 JUAN FERNANDO ELIZAGA yes
#> 4          MARCO TORRES  no
#> 5   JUAN PABLO GONZALEZ yes
#> 6            IRMA GOMEZ yes

Benchmark Code

# Make df2 much larger
df2xl <- df2 |> 
  list() |> 
  rep(10000) |> 
  bind_rows()

bench::mark(
  Thomas = df1 %>%
    mutate(val = c("no", "yes")[1 + (rowSums(
        outer(
            strsplit(name, "\\s+"),
            strsplit(df2xl$name, "\\s+"),
            Vectorize(function(x, y) all(x %in% y) | all(y %in% x))
        )
    ) > 0)]),
  nested_map = {
    n2 <- df2xl$name |> 
      str_split("\\s+")
    
    df1 |>
      mutate(val = name |>
               str_split("\\s+") |>
               map_lgl(\(n1e) map_lgl(n2,
                                     \(n2e) all(n1e %in% n2e) | all(n2e %in% n1e)
                                     ) |> any()
                          ) |>
               factor(labels = c("no", "yes"))
             )
    },
  parallel_nested_map = {
    n2 <- df2xl$name |> 
      str_split("\\s+")
    
    df1 |> 
      mutate(val = name |> 
               str_split("\\s+") |> 
               future_map_lgl(\(n1e) map_lgl(n2, 
                                     \(n2e) all(n1e %in% n2e) | all(n2e %in% n1e)
                                     ) |> any()
                          ) |> 
               factor(labels = c("no", "yes"))
             )
    },
  check = F,
  min_iterations = 10,
  filter_gc = F
)

Benchmark Result

#> # A tibble: 3 × 6
#>   expression               min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr>          <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 Thomas                 2.48s    2.53s     0.396   21.51MB     8.31
#> 2 nested_map             1.72s    1.73s     0.563    2.46MB     8.28
#> 3 parallel_nested_map    1.07s    1.22s     0.827    2.86MB     2.56

Created on 2022-04-12 by the reprex package (v2.0.1)

0
On

I'm going for robust here, rather than speed or elegance. I am certain that it could be cleaned up and sped up, but this addresses all of the discussed concerns including the added specification in the comments about handling family and given name matches separately.

First, here is updated data that has the family and given names split:

df1_split <-
  tibble(
    Given = c("JUAN", "GINA", "JUAN FERNANDO"
              , "MARCO", "JUAN PABLO", "IRMA"
              , "JUAN", "JUAN CARLOS")
    , Family = c("GIRON", "OLEAS", "ELIZAGA"
                 , "TORRES", "GONZALEZ", "GOMEZ"
                 , "GOMEZ", "MARTINEZ")
  )
  
df2_split <-
  tibble(
    Given = c("JUANA", "FERNANDO", 
              "IRMA", "PABLO", "GINA", "MARK", 
              "LETICIA", "JUAN PABLO"
              , "FERNANDO CARLOS"
              , "JUAN FERNANDO")
    , Family = c("MARQUEZ", "ELIZAGA", 
                 "GOMEZ", "GONZALEZ", "LUCIO", "TORRES", 
                 "BLACIO", "GIRON BELTRAN"
                 , "MARTINEZ"
                 , "ELIZAGA")
  )

Note that I added a couple of names to highlight some of the problems with the doubled names.

Then, this function will check any name set. It handles the names differently depending on whether they are single (e.g. "Juan") or dual-named (e.g., "Juan Carlos"). For single names to be checked, it just looks if they are present at all. For dual-named names, it checks that they are in the same order if the name to check against has two names as well or if either name is present if the name to check against only has one name.

check_names <- function(to_check, against){
  split_against <-
    str_split(against, " ")
  
  str_split(to_check, " ") %>%
    lapply(function(this_name){
      if(length(this_name) == 1){
        to_ret <-
          sapply(split_against, function(this_against){
            any(this_name == this_against)
          }) %>%
          which
      } else if(length(this_name) == 2){
        to_ret <-
          sapply(split_against, function(this_against){
            if(length(this_against) ==  2){
              return(all(this_against == this_name))
            } else if(length(this_against) ==  1){
              return(any(this_against == this_name))
            } else{
              stop("Names (against) cannot have three words: "
                   , this_against)
            }
          }) %>%
          which
      } else{
        stop("Names (to_check) cannot have three words: "
             , this_name)
      }
    })
  
}

Then, we wrap that function together to pass the family and given names separately. The results are then checked to see if there are any indices that are matches for both the family and given names.

check_both_simple <- function(to_check_given, to_check_family, against_given, against_family){
  checked_given <- check_names(to_check_given, against_given)
  checked_family <- check_names(to_check_family, against_family)
  
  valid_matches <- lapply(1:length(checked_given), function(idx){
    checked_given[[idx]][checked_given[[idx]] %in% checked_family[[idx]]]
  })
  
  to_return <-
    ifelse(sapply(valid_matches, length) > 0
           , "yes"
           , "no")
  
  return(to_return)
  
}

We can then use this in a call from mutate to add the column:

df1_split %>%
  mutate(Match = check_both_simple(Given, Family
                                   , df2_split$Given
                                   , df2_split$Family))

Returns:

# A tibble: 8 × 3
  Given         Family   Match
  <chr>         <chr>    <chr>
1 JUAN          GIRON    yes  
2 GINA          OLEAS    no   
3 JUAN FERNANDO ELIZAGA  yes  
4 MARCO         TORRES   no   
5 JUAN PABLO    GONZALEZ yes  
6 IRMA          GOMEZ    yes  
7 JUAN          GOMEZ    no   
8 JUAN CARLOS   MARTINEZ no   

And should handle all of the odd edge cases discussed in the comments as well.

The nice thing about this approach (and part of why I built it this robustly in the first place), is that you can also set the function to return the matching indices.

check_both_idx <- function(to_check_given, to_check_family, against_given, against_family){
  checked_given <- check_names(to_check_given, against_given)
  checked_family <- check_names(to_check_family, against_family)
  
  valid_matches <- lapply(1:length(checked_given), function(idx){
    checked_given[[idx]][checked_given[[idx]] %in% checked_family[[idx]]]
  })
  
  return(valid_matches)
  
}

Using this, you can actually pull the matches that are found and manually inspect them. This would allow you to identify any additional edge cases where you didn't agree with a found match or where one match may be markedly better than another.

df1_split %>%
  mutate(Match_idx = check_both_idx(Given, Family
                                    , df2_split$Given
                                    , df2_split$Family)
         , Matches = sapply(Match_idx, function(this_idx_set){
           paste(df2_split$Given[this_idx_set]
                 , df2_split$Family[this_idx_set]) %>%
             paste(collapse = "; ")
         })
         , Match = ifelse(sapply(Match_idx, length) > 0
                          , "yes"
                          , "no")
         , Match_idx = sapply(Match_idx, paste, collapse = "; ")
         )

Returns:

# A tibble: 8 × 5
  Given         Family   Match_idx Matches                                   Match
  <chr>         <chr>    <chr>     <chr>                                     <chr>
1 JUAN          GIRON    "8"       "JUAN PABLO GIRON BELTRAN"                yes  
2 GINA          OLEAS    ""        ""                                        no   
3 JUAN FERNANDO ELIZAGA  "2; 10"   "FERNANDO ELIZAGA; JUAN FERNANDO ELIZAGA" yes  
4 MARCO         TORRES   ""        ""                                        no   
5 JUAN PABLO    GONZALEZ "4"       "PABLO GONZALEZ"                          yes  
6 IRMA          GOMEZ    "3"       "IRMA GOMEZ"                              yes  
7 JUAN          GOMEZ    ""        ""                                        no   
8 JUAN CARLOS   MARTINEZ ""        ""                                        no   

Editing to add: The following two sets present some tricky examples that would currently be decided incorrectly by the other answers. These examples arose from the discussion in the comments to clarify what should match.

tricky_1 <-
  tibble(
    Given = c("JUAN", "JUANITA GINA"
              , "JUAN CARLO", "GOMEZ")
    , Family = c("GIRON BELTRAN", "OLEAS"
                 , "MARTINEZ", "IRMA")
  )


tricky_2 <-
  tibble(
    Given = c("JUAN PABLO", "GINA"
              , "CARLO JUAN", "IRMA")
    , Family = c("GIRON", "OLEAS GIRON"
                 , "MARTINEZ", "GOMEZ")
  )

We can view them side by side like so:

bind_cols(
  tricky_1 %>%
    setNames(paste0("toCheck_", names(.)))
  , tricky_2 %>%
    setNames(paste0("against_", names(.)))
) %>%
  mutate(shouldMatch = c("yes", "yes", "no", "no"))

returns:

# A tibble: 4 × 5
  toCheck_Given toCheck_Family against_Given against_Family shouldMatch
  <chr>         <chr>          <chr>         <chr>          <chr>      
1 JUAN          GIRON BELTRAN  JUAN PABLO    GIRON          yes        
2 JUANITA GINA  OLEAS          GINA          OLEAS GIRON    yes        
3 JUAN CARLO    MARTINEZ       CARLO JUAN    MARTINEZ       no         
4 GOMEZ         IRMA           IRMA          GOMEZ          no    

The first two should match because the family and given names each have a 1-2 match in each direction. However, that means that neither the name to check or the name to check against is completely contained in the other. The third shares all of the components, but I believe that "JUAN CARLO" should not match "CARLO JUAN." The fourth has the family and given names reversed, so shouldn't return a match.

The code from my answer handles these cases:

tricky_1 %>%
  mutate(Match_idx = check_both_idx(Given, Family
                                    , tricky_2$Given
                                    , tricky_2$Family)
         , Matches = sapply(Match_idx, function(this_idx_set){
           paste(tricky_2$Given[this_idx_set]
                 , tricky_2$Family[this_idx_set]) %>%
             paste(collapse = "; ")
         })
         , Match = ifelse(sapply(Match_idx, length) > 0
                          , "yes"
                          , "no")
         , Match_idx = sapply(Match_idx, paste, collapse = "; ")
  )

Returns:

# A tibble: 4 × 5
  Given        Family        Match_idx Matches            Match
  <chr>        <chr>         <chr>     <chr>              <chr>
1 JUAN         GIRON BELTRAN "1"       "JUAN PABLO GIRON" yes  
2 JUANITA GINA OLEAS         "2"       "GINA OLEAS GIRON" yes  
3 JUAN CARLO   MARTINEZ      ""        ""                 no   
4 GOMEZ        IRMA          ""        ""                 no  
0
On
library(data.table)
library(stringi)
library(purrr)

setDT(df1)
df1[,val := fifelse(map_lgl(stri_replace_all_fixed(name, " ", "|"), ~any(stri_count_regex(..df2$name, .x) >= 2)), "yes", "no")][]
#>                     name val
#> 1:            JUAN GIRON yes
#> 2:            GINA OLEAS  no
#> 3: JUAN FERNANDO ELIZAGA yes
#> 4:          MARCO TORRES  no
#> 5:   JUAN PABLO GONZALEZ yes
#> 6:            IRMA GOMEZ yes

This creates a vector of partial match counts (needles) per name in df2 (haystack).

[[1]]
[1] 1 0 0 0 0 0 0 2

[[2]]
[1] 0 0 0 0 1 0 0 0

[[3]]
[1] 1 2 0 0 0 0 0 1

[[4]]
[1] 0 0 0 0 0 1 0 0

[[5]]
[1] 1 0 0 2 0 0 0 2

[[6]]
[1] 0 0 2 0 0 0 0 0

I've set the number of minimum matches that result in a "yes" to 2, but this number can be tweaked (given we don't know which part is a family name or given name).

0
On
df2_flat <- df2$name |> stringr::str_split(" ") |> purrr::flatten()                                                                                                                                                                                                 

df1 |> 
    mutate(splitnames = stringr::str_split(name, " ")) |> 
    rowwise() |>
    mutate(val = all(splitnames %in% df2_flat)) |>
    select(-splitnames)

Produces the list you specified

0
On

In case the order of the names does not change .* could be inserted between the names and then use grep in both directions (df1$names in df2$names and df2$names in df1$names) and combine them with or |.

transform(df1, val = c("no", "yes")[1+(sapply(gsub(" +", ".*", df1$name),
       \(x) any(grep(x, df2$name)), USE.NAMES = FALSE) |
       Reduce(\(y, x) y | grepl(x, df1$name), gsub(" +", ".*", df2$name), FALSE))])
#                   name val
#1            JUAN GIRON yes
#2            GINA OLEAS  no
#3 JUAN FERNANDO ELIZAGA yes
#4          MARCO TORRES  no
#5   JUAN PABLO GONZALEZ yes
#6            IRMA GOMEZ yes

In case no partly match of the names are allowed surround the names with \b.

transform(df1, val = c("no", "yes")[1+(sapply(
                        gsub(" *(\\b[^ ]+\\b)", ".*\\\\b\\1\\\\b", df1$name),
                        \(x) any(grep(x, df2$name)), USE.NAMES = FALSE) |
   Reduce(\(y, x) y | grepl(x, df1$name),
     gsub(" *(\\b[^ ]+\\b)", ".*\\\\b\\1\\\\b", df2$name), FALSE))])

In case the order can chaange make a positive look ahead by placing the name in (?=.*NAMME) or also surround the name with \b (?=.*\\bNAME\\b).

transform(df1, val = c("no", "yes")[1+(sapply(
            gsub(" *(\\b[^ ]+\\b)", "(?=.*\\\\b\\1\\\\b)", df1$name),
         \(x) any(grep(x, df2$name, perl=TRUE)), USE.NAMES = FALSE) |
         Reduce(\(y, x) y | grepl(x, df1$name, perl=TRUE),
         gsub(" *(\\b[^ ]+\\b)", "(?=.*\\\\b\\1\\\\b)", df2$name), FALSE))] )

Its also possible to use agrepl and allow deletions which will be similar to the version assuming that the order of the names does not change and part matches of the name are allowed.

transform(df1, val = c("no", "yes")[1+(
  sapply(df1$name, \(x) any(agrepl(x, df2$name,
       list(cost=99, insertions=0, deletions=99, substitutions=0)))) |
  Reduce(\(y, x) y | agrepl(x, df1$name, list(cost=99, insertions=0,
       deletions=99, substitutions=0)), df2$name, FALSE))])

Another option can be the usage of look up tables:

s1 <- strsplit(df1$name, " ", TRUE)
lup1 <- list2env(split(rep(seq_along(s1), lengths(s1)), unlist(s1)))
s2 <- strsplit(df2$name, " ", TRUE)
lup2 <- list2env(split(rep(seq_along(s2), lengths(s2)), unlist(s2)))
`[<-`(sapply(s1, \(x) any(Reduce(intersect, mget(x, lup2, ifnotfound =
    list(NULL))))), unlist(lapply(s2, \(x) Reduce(intersect, mget(x, lup1,
       ifnotfound = list(NULL))))), TRUE)
#[1]  TRUE FALSE  TRUE FALSE  TRUE  TRUE

Benchmark:

Its also possible to limit the comparisons only to those which didn't have a match (GKi1b) where maybe the usage of indices using which instead of using the logical vector twice could fuhrer improve and making an exit of the loop in case all have a hit. In case the names are not unique use unique on the names.

library(dplyr)
bench::mark(
  Thomas = df1 %>%
    mutate(val = c("no", "yes")[1 + (rowSums(
        outer(
            strsplit(name, "\\s+"),
            strsplit(df2$name, "\\s+"),
            Vectorize(function(x, y) all(x %in% y) | all(y %in% x))
        )
    ) > 0)]),
  GKi1 = transform(df1, val = c("no", "yes")[1+(sapply(gsub(" +", ".*", df1$name),
       \(x) any(grep(x, df2$name)), USE.NAMES = FALSE) |
       Reduce(\(y, x) y | grepl(x, df1$name), gsub(" +", ".*", df2$name), FALSE))]),
  GKi1b = transform(df1, val = c("no", "yes")[1 +
    Reduce(\(i, x) `[<-`(i, !i, grepl(x, df1$name[!i])), gsub(" +", ".*",
      df2$name), sapply(gsub(" +", ".*", df1$name), \(x) any(grep(x, df2$name)),
        USE.NAMES = FALSE)) ]),
  GKi1c = transform(df1, val = c("no", "yes")[1+(sapply(gsub(" +", ".*", df1$name),
       \(x) any(grep(x, df2$name)), USE.NAMES = FALSE) |
       grepl(paste(gsub(" +", ".*", df2$name), collapse = "|"), df1$name) )]),
  GKi2 = transform(df1, val = c("no", "yes")[1+(sapply(
                        gsub(" *(\\b[^ ]+\\b)", ".*\\\\b\\1\\\\b", df1$name),
                        \(x) any(grep(x, df2$name)), USE.NAMES = FALSE) |
   Reduce(\(y, x) y | grepl(x, df1$name),
   gsub(" *(\\b[^ ]+\\b)", ".*\\\\b\\1\\\\b", df2$name), FALSE))]),
  GKi3 = transform(df1, val = c("no", "yes")[1+(sapply(
            gsub(" *(\\b[^ ]+\\b)", "(?=.*\\\\b\\1\\\\b)", df1$name),
         \(x) any(grep(x, df2$name, perl=TRUE)), USE.NAMES = FALSE) |
         Reduce(\(y, x) y | grepl(x, df1$name, perl=TRUE),
         gsub(" *(\\b[^ ]+\\b)", "(?=.*\\\\b\\1\\\\b)", df2$name), FALSE))] ),
  GKi4 = transform(df1, val = c("no", "yes")[1+(
  sapply(df1$name, \(x) any(agrepl(x, df2$name,
       list(cost=99, insertions=0, deletions=99, substitutions=0)))) |
  Reduce(\(y, x) y | agrepl(x, df1$name, list(cost=99, insertions=0,
                                              deletions=99, substitutions=0)), df2$name, FALSE))]),
  GKi5 = {
    s1 <- strsplit(df1$name, " ", TRUE)
    lup1 <- list2env(split(rep(seq_along(s1), lengths(s1)), unlist(s1)))
    s2 <- strsplit(df2$name, " ", TRUE)
    lup2 <- list2env(split(rep(seq_along(s2), lengths(s2)), unlist(s2)))
    transform(df1, val = c("no", "yes")[1+`[<-`(sapply(s1, \(x) any(Reduce(base::intersect, mget(x, lup2, ifnotfound =
    list(NULL))))), unlist(lapply(s2, \(x) Reduce(base::intersect, mget(x, lup1,
       ifnotfound = list(NULL))))), TRUE)])
  }
)
  expression      min median `itr/sec` mem_alloc `gc/sec` n_itr  n_gc total_time
  <bch:expr> <bch:tm> <bch:>     <dbl> <bch:byt>    <dbl> <int> <dbl>   <bch:tm>
1 Thomas        863µs  894µs      919.    4.08KB    21.4    429    10      467ms
2 GKi1          211µs  218µs     3770.        0B    14.6   1803     7      478ms
3 GKi1b         211µs  226µs     3020.        0B    14.6   1448     7      479ms
4 GKi1c         183µs  200µs     3424.        0B    10.3   1667     5      487ms
5 GKi2          262µs  275µs     2755.        0B    12.4   1336     6      485ms
6 GKi3          391µs  409µs     2010.        0B     9.19   875     4      435ms
7 GKi4          374µs  386µs     2295.        0B    16.5   1110     8      484ms
8 GKi5          272µs  285µs     2570.    2.82KB    21.1   1220    10      475ms

All variants are more than 2 times faster than ThomasIsCoding using one CPU-core.

0
On

How I would go about it

library(tidyverse)

df1 %>% 
  mutate(val = sapply(name, \(n) {
    
    result = strsplit(n, " ")[[1]] %>% 
      sapply(., \(sn) { #loop through each name and look in df2
        sum(grepl(sn, df2$name)) #Could be rewritten to a for-loop to break at the first missed name
      })
    
    if(0 %in% result) {
      return("no") #at least one name was not found
    } else {
      return("yes") #defaults to yes, since this can only be reached if no names have been missed (i.e. all have been found)
    }
  }))
0
On

Adding another solution that I didn't see above:

Basically use either left_join(), anti_join and full_join to match the values in the dataframe

#see which values in df2 match df1
left_join(df1,df2,by="name")

#see which values don't match in either df1 and df2
anti_join(df1,df2,by="name")