A list of data frames:
my_list <- list(structure(list("_uuid" = c("xxxyz",
"xxxyz", "zzuio", "iiopz"), country = c("USA",
"USA", "Canada", "Switzerland")), class = "data.frame", row.names = c(NA, -4L)),
structure(list("_uuid" = c("xxxyz", "ppuip",
"zzuio"), country = c("USA", "Canada", "Canada")), class = "data.frame", row.names = c(NA,
-3L)))
my_list
[[1]]
_uuid country
1 xxxyz USA
2 xxxyz USA
3 zzuio Canada
4 iiopz Switzerland
[[2]]
_uuid country
1 xxxyz USA
2 ppuip Canada
3 zzuio Canada
I want to remove duplicated rows both within and between the data frames stored in that list.
This works to remove duplicates within each data frame:
my_list <- lapply(my_list, function(z) z[!duplicated(z[["_uuid"]]),])
my_list
[[1]]
_uuid country
1 xxxyz USA
3 zzuio Canada
4 iiopz Switzerland
[[2]]
_uuid country
1 xxxyz USA
2 ppuip Canada
3 zzuio Canada
But there are still duplicates between data frames. I want to remove them all, with the following desired output:
[[1]]
_uuid country
iiopz Switzerland
[[2]]
_uuid country
xxxyz USA
zzuio Canada
ppuip Canada
Notes:
- I want to eliminate duplicates on
_uuid(other variables can be duplicated) - I need a solution where it is not needed to merge the data frames before checking for duplicates
- If possible, I wish to retain the last observation. For example, in the desired output above, "zzuio Canada" existed in both df, but was kept in the last df only, that is, df 2.
- I have more than 100 dfs, with variable names that don't necessarily match between dfs. That said, the id is always called "_uuid"
- I need to reassign the result to the same object (in the case above,
my_list)
Here's a shot, starting with a reduction and then
Map-applying it to the original list of frames.This gives us the ids from frames further in
my_listthat we need to remove "here". We interpret this to mean that for the last frame, we have no IDs we need to remove from other frames; in the first frame, we have 3 ids that are seen later in the list, so they need to be removed from "this" (first) frame. (Side note: perhaps the nameprevious_idsis a misnomer ...)With this, we can do simply:
Using your updated data, this still works. The only thing is since your ID field is non-standard (starting with a
_, R does not like that), we need to either use backticksdat$`_uuid`or use[[dat[["_uuid"]].