I have two different dataframes in R that I am trying to merge together. One is just a set of names and the other is a set of names with corresponding information about each person.
So say I want to take this first dataframe:
Name
1. Blow, Joe
2. Smith, John
3. Jones, Tom
etc....
and merge it to this one:
DonorName CandidateName DonationAmount CandidateParty
1 blow joe Bush, George W 3,000 Republican
2 guy some Obama, Barack 5,000 Democrat
3 smith john Reid, Harry 4,000 Democrat
such that I'd have a new list that includes only people on my first list with the information from the second. Were the two "Name" values formatted in the same way, I could just use merge(), but would there be a way to somehow use agrep() or pmatch() to do this?
Also, the 2nd dataframe I'm working with has about 25 million rows in it and 6 columns, so would making a for loop be the fastest way to go about this?
Reproducible versions of the example data:
first <- data.frame(Name=c("Blow, Joe","Smith, John","Jones, Tom"),
stringsAsFactors=FALSE)
second <- read.csv(text="
DonorName|CandidateName|DonationAmount|CandidateParty
blow joe|Bush, George W|3,000|Republican
guy some|Obama, Barack|5,000|Democrat
smith john|Reid, Harry|4,000|Democrat",header=TRUE,sep="|",
stringsAsFactors=FALSE)
solution:
will give you what you need if the data is as you've provided it.
"fast way to go about this"
The
dplyr
method as above:data.table
method, setting key on first.data.table
method, setting both keys.base
method relying onsapply
:let's make second df a bit more realistic at 1M obs for a fairish comparision:
And we get:
On my machine, with this ?contrived example we gain about 2.5 seconds over
base
methods.sapply
simplifies and doesn't scale very well in my experience... this gap likely gets bigger when you increase the number of unique groups infirst
andsecond
.Please feel free to edit if you come up with more efficient use. I don't pretend to know, but I always try to learn something.