Master view of multiple dataframes with common columns

163 Views Asked by At

I have three dataframes like below:

df3 <- data.frame(col1=c('A','C','E'),col2=c(4,8,2))
df2 <- data.frame(col1=c('A','B','C','E','I'),col2=c(4,6,8,2,9))
df1 <- data.frame(col1=c('A','D','C','E','I'),col2=c(4,7,8,2,9))

The differences between any two files could be as below:

anti_join(df2, df3)
# Joining, by = c("col1", "col2")
#   col1 col2
# 1    B    6
# 2    I    9

anti_join(df3, df2)
# Joining, by = c("col1", "col2")
# [1] col1 col2
# <0 rows> (or 0-length row.names)

anti_join(df1, df2)
# Joining, by = c("col1", "col2")
#   col1 col2
# 1    D    7

anti_join(df2, df1)
# Joining, by = c("col1", "col2")
#   col1 col2
# 1    B    6

I would like to create a master dataframe with all the values in col1 and col2 specific to each dataframe. If there is no such value present, it should populate NA.

  col1 df1_col2 df2_col2 df3_col2
1    A        4        4        4 
2    B       NA        6       NA  
3    C        8        8        8
4    E        2        2        2 
5    I        9        9       NA
6    D        7       NA       NA

The essence of the above output could be established from the above anti_join commands. However, it does not provide the complete picture at once. Any thoughts on how to achieve this?

Edit: For multiple values in col2 for col1, the output is a little messier. For example, A has values 4, 3.

df3 <- data.frame(col1=c('A','C','E'),col2=c(4,8,2))
df2 <- data.frame(col1=c('A','A','B','C','E','I'),col2=c(4,3,6,8,2,9))
df1 <- data.frame(col1=c('A','A','D','C','E','I'),col2=c(4,3,7,8,2,9))

lst_of_frames <- list(df1 = df1, df2 = df2, df3 = df3)
lst_of_frames %>%
  imap(~ rename_at(.x, -1, function(z) paste(.y, z, sep = "_"))) %>%
  reduce(full_join, by = "col1")

It gives the below output.

#   col1 df1_col2 df2_col2 df3_col2
# 1    A        4        4        4
# 2    A        4        3        4
# 3    A        3        4        4
# 4    A        3        3        4
# 5    D        7       NA       NA
# 6    C        8        8        8
# 7    E        2        2        2
# 8    I        9        9       NA
# 9    B       NA        6       NA

The interesting part of the output is:

#   col1 df1_col2 df2_col2 df3_col2
# 1    A        4        4        4
# 2    A        4        3        4
# 3    A        3        4        4
# 4    A        3        3        4

whereas the expected output is:

#   col1 df1_col2 df2_col2 df3_col2
# 1    A        4        4        4
# 2    A        3        3       NA
2

There are 2 best solutions below

8
On BEST ANSWER

Similar to @tamtam's answer, but a little programmatic if you have a dynamic list of frames.

lst_of_frames <- list(df1 = df1, df2 = df2, df3 = df3)
# lst_of_frames <- tibble::lst(df1, df2, df3)    # thanks, @user63230
library(dplyr)
library(purrr)  # imap, reduce
lst_of_frames %>%
  imap(~ rename_at(.x, -1, function(z) paste(.y, z, sep = "_"))) %>%
  reduce(full_join, by = "col1")
#   col1 df1_col2 df2_col2 df3_col2
# 1    A        4        4        4
# 2    D        7       NA       NA
# 3    C        8        8        8
# 4    E        2        2        2
# 5    I        9        9       NA
# 6    B       NA        6       NA

It's important (for automatically renaming the columns) that the list-of-frames be a named list; my assumption was the name of the frame variable list(df1=df1), but it could just as easily be list(A=df1) to produce a column named A_col2 in the end.

3
On

You may use the full_join function from the dplyr package.

df_master <- df1 %>% 
  full_join(df2, by = "col1") %>% 
  full_join(df3, by = "col1") %>% 
  select(col1, df1_col2 = col2.x, 
         df2_col2 = col2.y,
         df3_col2 = col2)

  col1 df1_col2 df2_col2 df3_col2
1    A        4        4        4
2    D        7       NA       NA
3    C        8        8        8
4    E        2        2        2
5    I        9        9       NA
6    B       NA        6       NA