Merging datasets by name when names have different formats in R

836 Views Asked by At

I have two different dataframes in R that I am trying to merge together. One is just a set of names and the other is a set of names with corresponding information about each person.

So say I want to take this first dataframe:

Name
1. Blow, Joe
2. Smith, John
3. Jones, Tom 
etc....

and merge it to this one:

   DonorName  CandidateName DonationAmount CandidateParty
1   blow joe Bush, George W          3,000     Republican
2   guy some  Obama, Barack          5,000       Democrat
3 smith john    Reid, Harry          4,000       Democrat

such that I'd have a new list that includes only people on my first list with the information from the second. Were the two "Name" values formatted in the same way, I could just use merge(), but would there be a way to somehow use agrep() or pmatch() to do this?

Also, the 2nd dataframe I'm working with has about 25 million rows in it and 6 columns, so would making a for loop be the fastest way to go about this?

Reproducible versions of the example data:

first <- data.frame(Name=c("Blow, Joe","Smith, John","Jones, Tom"),
         stringsAsFactors=FALSE)

second <- read.csv(text="
DonorName|CandidateName|DonationAmount|CandidateParty
blow joe|Bush, George W|3,000|Republican
guy some|Obama, Barack|5,000|Democrat
smith john|Reid, Harry|4,000|Democrat",header=TRUE,sep="|",
stringsAsFactors=FALSE)
2

There are 2 best solutions below

5
On BEST ANSWER

solution:

first$DonorName <- gsub(", "," ",tolower(first$Name),fixed=TRUE)

require(dplyr)

result <- inner_join(first,second,by="DonorName")

will give you what you need if the data is as you've provided it.

result
         Name  DonorName  CandidateName DonationAmount CandidateParty
1   Blow, Joe   blow joe Bush, George W          3,000     Republican
2 Smith, John smith john    Reid, Harry          4,000       Democrat

"fast way to go about this"

The dplyr method as above:

f_dplyr <- function(left,right){
   left$DonorName <- gsub(", "," ",tolower(left$Name),fixed=TRUE)
   inner_join(left,right,by="DonorName")
}

data.table method, setting key on first.

f_dt <- function(left,right){
   left[,DonorName :=  gsub(", "," ",tolower(Name),fixed=TRUE)]
   setkey(left,DonorName)
   left[right,nomatch=0L]
}

data.table method, setting both keys.

f_dt2 <- function(left,right){
   left[,DonorName :=  gsub(", "," ",tolower(Name),fixed=TRUE)]
   setkey(left,DonorName)
   setkey(right,DonorName)
   left[right,nomatch=0L]
}

base method relying on sapply:

f_base <- function(){
  second[second$DonorName %in%
  sapply(tolower(first[[1]]), gsub, pattern = ",", replacement = "", fixed = TRUE), ]
}

let's make second df a bit more realistic at 1M obs for a fairish comparision:

second <- cbind(second[rep(1:3,1000000),],data.frame(varn= 1:1000000))
left <- as.data.table(first)
right <- as.data.table(second)

library(microbenchmark)

microbenchmark(
          f_base(),
          f_dplyr(first,second),
          f_dt(left,right),
          f_dt2(left,right),
          times=20)

And we get:

Unit: milliseconds
                   expr       min        lq    median        uq       max neval
               f_base() 2880.6152 3031.0345 3097.3776 3185.7903 3904.4649    20
 f_dplyr(first, second)  292.8271  362.7379  454.6864  533.9147  774.1897    20
      f_dt(left, right)  489.6288  531.4152  605.4148  788.9724 1340.0016    20
     f_dt2(left, right)  472.3126  515.4398  552.8019  659.7249  901.8133    20

On my machine, with this ?contrived example we gain about 2.5 seconds over base methods. sapply simplifies and doesn't scale very well in my experience... this gap likely gets bigger when you increase the number of unique groups in first and second.

Please feel free to edit if you come up with more efficient use. I don't pretend to know, but I always try to learn something.

0
On

Without dplyr:

second[second$DonorName %in%
  sapply(tolower(first[[1]]), gsub, pattern = ",", replacement = "", fixed = TRUE), ]

Result:

#     DonorName  CandidateName DonationAmount CandidateParty
# 1   blow joe  Bush, George W          3,000     Republican
# 3 smith john     Reid, Harry          4,000       Democrat