R: Merging 2 dataframes and applying reference data to all rows that match by one level

82 Views Asked by At

I have two data frames: one ("grny") that is mainly a reference but also has some data in the "yield" column I'm after, and another ("txie") that will have "yield" data with a few NA's for missing data. I want to merge them so that all cells from a row with a common value in "site" are complete.

Where most of the year-by-year data is:

txie<-data.frame (site=c(rep("smithfield",2),rep("belleville",3)),
yield=c((rnorm(4, mean=8)),NA),
year=c(1999:2000,1992:1994),
prim=c(rep("nt",2),rep(NA,3)))

Mostly reference with some year-by-year yield data:

grny<-data.frame (site=c("smithfield","belleville",rep("nashua",3)),
yield=c(rep(NA,2),rnorm(3,mean=9)),
year=c(rep(NA,2),1990:1992),
prim=c(NA,"nt",sample(c("nt","ct"),3,rep=TRUE)),
lat=(c(rnorm(2,mean=45,sd=10),rep(49.1,3))))

What I want:

         site    yield year prim  lib      lat
1  smithfield 7.009178 1999   nt 1109     43.61828
2  smithfield 8.472677 2000   nt 1109     43.61828
3  belleville 8.857462 1992   nt 122      74.08792
4  belleville 7.368488 1993   nt 122      74.08792
5  belleville       NA 1994   nt 122      74.08792
6  nashua     7.494519 1990   nt 554      49.10000
8  nashua     8.696066 1991   ct 554      49.10000
9  nashua     8.051670 1992   nt 554      49.10000

What I've tried:

rbind.fill(txie,grny) #this appends rows to the correct columns but leaves NA's everywhere because it doesn't know I want data missing in grny filled in when it is available in txie
Reduce(function(x,y) merge(txie,grny, by="site", all.y=TRUE), list(txie,grny)) #this merges by rows but creates new variables from x and y.
merge(x = txie, y = grny, by = "site", all = TRUE) #this does the same as  the above (new variables from each x and y ending in .x or .y)
merge(x = txie, y = grny, by = "site", all.x = TRUE)#this does similar to above but merges based on the x df  (new variables from each x and y ending in .x or .y)
setkey(setDT(grny),site)[txie]# this gives a similar result to the all.x line

For example, with the outer join merge I end up with:

     site  yield.x year.x prim.x  yield.y year.y prim.y      lat
1 belleville 6.766628   1992   <NA>       NA     NA     nt 34.92136
2 belleville 6.845789   1993   <NA>       NA     NA     nt 34.92136
3 belleville       NA   1994   <NA>       NA     NA     nt 34.92136
4 smithfield 8.841339   1999     nt       NA     NA   <NA> 49.81872
5 smithfield 7.313310   2000     nt       NA     NA   <NA> 49.81872
6     nashua       NA     NA   <NA> 9.173229   1990     ct 49.10000
7     nashua       NA     NA   <NA> 9.196018   1991     nt 49.10000
8     nashua       NA     NA   <NA> 7.336645   1992     ct 49.10000

Stipulations: I would like to keep NA's that were already in the "yield" column (eg. nashua in 1994). Any answers or can someone show me where an example of this kind of merge (with data already in one or more shared columns you're not merging by, each df bringing in new columns except the "by" variable) is?

Thanks!!!

1

There are 1 best solutions below

1
On BEST ANSWER

Using the dplyr package, you could do a full_join and then use the coalesce function to get the non NA value in the pairs of columns yield.x vs yield.y, prim.x vs prim.y etc..

library(dplyr)
full_join(txie,grny,by="site") %>%
mutate(year = coalesce(year.x,.$year.y),
yield = coalesce(yield.x,yield.y),
prim = coalesce(prim.x,prim.y)) %>% 
select(-c(year.x,year.y,yield.x,yield.y,prim.x,prim.y)) 

        site      lat year     yield prim
1 smithfield 59.71994 1999  7.920844   nt
2 smithfield 59.71994 2000 10.122713   nt
3 belleville 34.93358 1992  8.622351   nt
4 belleville 34.93358 1993  7.360470   nt
5 belleville 34.93358 1994        NA   nt
6     nashua 49.10000 1990  9.083390   ct
7     nashua 49.10000 1991  8.073866   nt
8     nashua 49.10000 1992  8.725625   nt