How to compare nested lists in dataframe per row in R?

Question

How to compare nested lists in dataframe per row in R?

95 Views Asked by EmmaV97 At 03 January 2024 at 13:47

I have a dataset that contains lists for values in two columns, and I want to be able to compare the lists with each other, for each row. For example, I have 3 groups, in which certain numbers are expected, and other numbers are observed. I want to compare the Observed column with the Expected column to see which numbers were expected, but not observed.

Group	Expected	Observed
A	4:8	c(4, 5, 7)
B	7:12	c(7, 8, 9, 10, 12)
C	6:10	c(6, 8, 10)

I want an extra column called Missing, that contains all the values that are in Expected, but not Observed.

Group	Expected	Observed	Missing
A	4:8	c(4, 5, 7)	c(6, 8)
B	7:12	c(7, 8, 9, 10, 12)	11
C	6:11	c(6, 8, 11)	c(7, 9, 10)

I have tried to use setdiff() and base R, as it can find the differing values between two lists. However, I cannot get it to work in the way that it compares the lists per row.

df$Missing <- setdiff(df$Expected, df$Observed) 

df$Missing <- df$Expected[!(df$Expected %in% df$Observed)]

Both these options result in the full list of Expected. This is the output that I get:

Group	Expected	Observed	Missing
A	4:8	c(4, 5, 7)	4:8
B	7:12	c(7, 8, 9, 10, 12)	7:12
C	6:11	c(6, 8, 11)	6:11

Is there any way that I can compare the two lists (Observed vs. Expected) per group, so I can see which values are missing per group? Thank you in advance for any help!

Original Q&A

There are 3 best solutions below

jblood94 On 03 January 2024 at 13:59

With a data.table anti-join:

library(data.table)

dt[
  dt[,.(x = unlist(Expected)), Group][
    !dt[,.(x = unlist(Observed)), Group], on = .(Group, x)
  ][, .(x = .(x)), Group], on = "Group", Missing := x
][]
#>    Group          Expected       Observed  Missing
#> 1:     A         4,5,6,7,8          4,5,7      6,8
#> 2:     B  7, 8, 9,10,11,12  7, 8, 9,10,12       11
#> 3:     C  6, 7, 8, 9,10,11        6, 8,10  7, 9,11

With data.table grouping operations:

library(collapse) # for %!in%

dt[,Missing := .(.(Expected[[1]][Expected %!in% Observed])), Group][]
#>    Group          Expected       Observed  Missing
#> 1:     A         4,5,6,7,8          4,5,7      6,8
#> 2:     B  7, 8, 9,10,11,12  7, 8, 9,10,12       11
#> 3:     C  6, 7, 8, 9,10,11        6, 8,10  7, 9,11
dt[
  ,.(
    Expected = Expected,
    Observed = Observed,
    Missing = .(setdiff(Expected[[1]], Observed[[1]]))
  ), Group
][]
#>    Group          Expected       Observed  Missing
#> 1:     A         4,5,6,7,8          4,5,7      6,8
#> 2:     B  7, 8, 9,10,11,12  7, 8, 9,10,12       11
#> 3:     C  6, 7, 8, 9,10,11        6, 8,10  7, 9,11

Data:

dt <- data.table(
  Group = LETTERS[1:3],
  Expected = list(4:8, 7:12, 6:11),
  Observed = list((4:7)[-3], (7:12)[-5], c(6L, 8L, 10L))
)

Benchmarking on a larger dataset:

dt <- data.table(
  Group = 1:1e4,
  Expected = lapply(sample(10, 1e4, 1), seq, 20)
)[, Observed := lapply(Expected, \(x) sample(x, sample(length(x), 1)))]

bench::mark(
  Map = {dt$Missing <- Map(setdiff, dt$Expected, dt$Observed); dt},
  setdiff = dt[,.(Expected = Expected, Observed = Observed, Missing = .(setdiff(Expected[[1]], Observed[[1]]))), Group],
  `%!in%` = dt[,Missing := .(.(Expected[[1]][Expected %!in% Observed])), Group],
  antiJoin = dt[
    dt[,.(x = unlist(Expected)), Group][
      !dt[,.(x = unlist(Observed)), Group], on = .(Group, x)
    ][, .(x = .(x)), Group], on = "Group", Missing := x
  ]
)

#> # A tibble: 4 × 6
#>   expression      min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 Map          82.4ms   88.9ms      9.81     1.7MB     25.5
#> 2 setdiff     106.8ms  111.7ms      8.84    1.98MB     15.9
#> 3 %!in%        30.6ms   33.7ms     28.9   149.84KB     15.4
#> 4 antiJoin     53.8ms     58ms     16.8    12.04MB     14.9

Looks like %!in% is the winner so far.

Hoel On 03 January 2024 at 14:07

Inspired by @RonakShah, with tidyverse approach

df %>% 
  dplyr::mutate(Missing = purrr::map2(Expected, Observed, setdiff))

  Group            Expected        Observed Missing
1     A       4, 5, 6, 7, 8         4, 5, 7    6, 8
2     B 7, 8, 9, 10, 11, 12 7, 8, 9, 10, 12      11
3     C      6, 7, 8, 9, 10        6, 8, 10    7, 9

**Ronak Shah** · Accepted Answer · 2024-01-03T13:55:24.110000

Since you have lists, use Map with setdiff -

df$Missing <- Map(setdiff, df$Expected, df$Observed)
df

#  Group            Expected        Observed Missing
#1     A       4, 5, 6, 7, 8         4, 5, 7    6, 8
#2     B 7, 8, 9, 10, 11, 12 7, 8, 9, 10, 12      11
#3     C      6, 7, 8, 9, 10        6, 8, 10    7, 9

data

It is easier to help if you provide data in reproducible format.

df <- structure(list(Group = c("A", "B", "C"), Expected = list(4:8, 
    7:12, 6:10), Observed = list(c(4, 5, 7), c(7, 8, 9, 10, 12
), c(6, 8, 10))), row.names = c(NA, -3L), class = "data.frame")

How to compare nested lists in dataframe per row in R?

There are 3 best solutions below

Related Questions in R

Related Questions in DATAFRAME

Related Questions in LIST

Related Questions in NESTED-LISTS

Related Questions in SET-DIFFERENCE

Trending Questions

Popular # Hahtags

Popular Questions