I have two data frames: one ("grny") that is mainly a reference but also has some data in the "yield" column I'm after, and another ("txie") that will have "yield" data with a few NA's for missing data. I want to merge them so that all cells from a row with a common value in "site" are complete.
Where most of the year-by-year data is:
txie<-data.frame (site=c(rep("smithfield",2),rep("belleville",3)),
yield=c((rnorm(4, mean=8)),NA),
year=c(1999:2000,1992:1994),
prim=c(rep("nt",2),rep(NA,3)))
Mostly reference with some year-by-year yield data:
grny<-data.frame (site=c("smithfield","belleville",rep("nashua",3)),
yield=c(rep(NA,2),rnorm(3,mean=9)),
year=c(rep(NA,2),1990:1992),
prim=c(NA,"nt",sample(c("nt","ct"),3,rep=TRUE)),
lat=(c(rnorm(2,mean=45,sd=10),rep(49.1,3))))
What I want:
site yield year prim lib lat
1 smithfield 7.009178 1999 nt 1109 43.61828
2 smithfield 8.472677 2000 nt 1109 43.61828
3 belleville 8.857462 1992 nt 122 74.08792
4 belleville 7.368488 1993 nt 122 74.08792
5 belleville NA 1994 nt 122 74.08792
6 nashua 7.494519 1990 nt 554 49.10000
8 nashua 8.696066 1991 ct 554 49.10000
9 nashua 8.051670 1992 nt 554 49.10000
What I've tried:
rbind.fill(txie,grny) #this appends rows to the correct columns but leaves NA's everywhere because it doesn't know I want data missing in grny filled in when it is available in txie
Reduce(function(x,y) merge(txie,grny, by="site", all.y=TRUE), list(txie,grny)) #this merges by rows but creates new variables from x and y.
merge(x = txie, y = grny, by = "site", all = TRUE) #this does the same as the above (new variables from each x and y ending in .x or .y)
merge(x = txie, y = grny, by = "site", all.x = TRUE)#this does similar to above but merges based on the x df (new variables from each x and y ending in .x or .y)
setkey(setDT(grny),site)[txie]# this gives a similar result to the all.x line
For example, with the outer join merge I end up with:
site yield.x year.x prim.x yield.y year.y prim.y lat
1 belleville 6.766628 1992 <NA> NA NA nt 34.92136
2 belleville 6.845789 1993 <NA> NA NA nt 34.92136
3 belleville NA 1994 <NA> NA NA nt 34.92136
4 smithfield 8.841339 1999 nt NA NA <NA> 49.81872
5 smithfield 7.313310 2000 nt NA NA <NA> 49.81872
6 nashua NA NA <NA> 9.173229 1990 ct 49.10000
7 nashua NA NA <NA> 9.196018 1991 nt 49.10000
8 nashua NA NA <NA> 7.336645 1992 ct 49.10000
Stipulations: I would like to keep NA's that were already in the "yield" column (eg. nashua in 1994). Any answers or can someone show me where an example of this kind of merge (with data already in one or more shared columns you're not merging by, each df bringing in new columns except the "by" variable) is?
Thanks!!!
Using the
dplyr
package, you could do afull_join
and then use thecoalesce
function to get the non NA value in the pairs of columnsyield.x
vsyield.y
,prim.x
vsprim.y
etc..