I have two datasets that I am trying to merge. They are not complete datasets, so this means that individuals are missing records.
Here is data1
(example is a subset of my real data):
squirrel_id age ageclass trialdate year OFT1 MIS1
10342 1 Y 2008-05-19 2008 0.605 -4.19
10342 2 A 2009-05-31 2009 -1.85 1.14
10342 3 A 2010-05-22 2010 -2.39 2.38
Here is data2
(example is a subset of my real data):
squirrel_id focal_age focal_ageclass focal_date focal_yr PC1 PC2
10342 1 Y 2008-07-14 2008 0.0932 -2.67
10342 3 A 2010-03-13 2010 -2.38 0.216
10342 3 A 2010-04-20 2010 0.0203 1.80
I'm trying to do two things:
- merge these two datasets so that I retain NAs when records are incomplete (i.e.,
data1
has 1 record atage==3
, whiledata2
has 2 records whenage==3
) - consolidate columns to make the dataset more streamlined (i.e., columns with different names in the datasets represent the same things:
age==focal_age
,ageclass==focal_ageclass
,trialnumber==focalseq
,ageclass==focal_ageclass
,year==focal_yr
)
Desired output - I am trying to have a final dataset that looks like this (where for age==3
the data1
record is only shown once, not twice):
squirrel_id age ageclass date year OFT1 MIS1 PC1 PC2
10342 1 Y 2008-05-19 2008 0.605 -4.19 NA NA
10342 1 Y 2008-07-14 2008 NA NA 0.0932 -2.67
10342 2 A 2009-05-31 2009 -1.85 1.14 NA NA
10342 3 A 2010-05-22 2010 -2.39 2.38 NA NA
10342 3 A 2010-03-13 2010 NA NA -2.38 0.216
10342 3 A 2010-04-20 2010 NA NA 0.0203 1.80
I am able to get partway here by doing:
data3<-full_join(data1, data2,
by=c("squirrel_id"="squirrel_id",
"year"="focal_yr",
"age"="focal_age",
"ageclass"="focal_ageclass"))
but this repeats the data1
values for age==3
for both age==3
rows in data2
(instead of just matching the first row only), giving this (not desired) output:
squirrel_id age ageclass trialdate focal_date year OFT1 MIS1 PC1 PC2
10342 1 Y 2008-05-19 2008-07-14 2008 0.605 -4.19 0.0932 -2.67
10342 2 A 2009-05-31 NA 2009 -1.85 1.14 NA NA
10342 3 A 2010-05-22 2010-03-13 2010 -2.39 2.38 -2.38 0.216
10342 3 A 2010-05-22 2010-04-20 2010 -2.39 2.38 0.0203 1.80
Updated Question: How do I have the matching records add NAs for all rows when doing a full_join
? Note that I'd rather a dplyr
solution, as I don't work in data.table
(like the answer to this OP) and I want to retain the rows that don't match (unlike this other OP).
Here is a
data.table
approachsample data
code