R: How to handle missings in a fuzzy matching setting with fedmatch?

69 Views Asked by At

I have a problem with missings in a fuzzy matching setting. I use the merge_plus command from the fedmatch-package to find for each observation in the basis dataset (df1) one appropriate observation from a large dataset (df2). I perform a string distance-based match between the two datasets on multiple variables.

  • If missings are stored as empty strings in both datasets, the problem arises that missings are evaluated as a perfect match.
  • If missings are stored as NA in df1, the compare type string-distance will not calculate a matching score. Therefore, the first observation from df2 is automatically matched.
  • If missings are stored as NA in df2, the observation row is not considered as a matching partner for df1. Even if perfect agreement is achieved on the basis of the other variables.

Attached is a code that demonstrates the problem. Is there a better way to handle missings?​

Thanks for help!

#
#packages-----------------------------------------------------------------------
if (!require(fedmatch)) {install.packages("fedmatch")}
library(fedmatch)
#
#
#
#######################
#PROBLEM: MISSINGS#####
#######################
#
#
#example data for problem ------------------------------------------------------
firm <- c('ABC cORP','INT PHARMA INC',NA)
firmid <- c('HR373736', 'HR373829', NA)
id1 <- c(1,2,3)
address <- c('STATE STREET','FIRST STREET','LAKE DRIVE')
df1 <- data.frame(firm, firmid, id1, address)
#
#
firm <- c('ABC cORP','INT PHARMA INC','BANK LOCAL')
firmid <- c('HR373736', 'HR373829', 'HR38493')
id2 <- c(1,2,3)
address <- c('STATE STREET', NA,'LAKE DRIVE')
df2 <- data.frame(firm, firmid, id2, address)
#
#
#fuzzy-matching-----------------------------------------------------------------
x <- benchmarkme::get_cpu()
threads <- max((x$no_of_cores - 1), 1)
#
fuzzy_result <- merge_plus(
  data1 = df1,                                                    ##dataset 1 for matching
  data2 = df2,                                                    ##dataset 2 for matching
  match_type = "multivar",                                        ##matching on multiple variables
  by = c("firm", "firmid", "address"),                            ##variables for fuzzy-match
  suffixes = c("_1", "_2"),
  unique_key_1 = "id1",                                           ##id-variable for dataset 1
  unique_key_2 = "id2",                                           ##id-variable for dataset 2                                                                  
  multivar_settings = build_multivar_settings(
    compare_type = c("stringdist", "indicator", "stringdist"),    ##indicator for matching
    wgts = c(.4, .3, .3), nthread = threads))                     ##weights
#                                                
#                                    
#
#
#
#problem-review-----------------------------------------------------------------
result <- fuzzy_result$matches
#
#
#
#
#End
0

There are 0 best solutions below