Intersection of items between groups in a data.table

317 Views Asked by At

I have a data.table with two columns, one with a groupID and the other with a color. I want to find the length of the intersections or a pairwise intersection operation between all groups. There are similar posts online but nothing exactly as far as what I am looking for.

require(data.table)


set.seed(1)
x <- data.table(
  groupID = paste0(sample(LETTERS), sample(LETTERS, replace = TRUE)),
  color = sapply(1:length(LETTERS), function(x) sample(colors()[1:10])[1:sample(5:10)[1]])
)

x <- x[, .(color = unlist(color)), keyby = groupID]

The table below doesn't have correct values but it would look something like this:

groups <- x[, .N, keyby = groupID][,groupID]; results <- CJ(groups, groups)
results[, intersectionLength := sapply(1:nrow(results), function(x) sample(5:10)[1])]

EDIT

This post has a similar question. How could I apply this to my problem?

2

There are 2 best solutions below

0
akrun On BEST ANSWER

Here is one optioin with Map to compare the pairwise elements of group columns to extract the intersecting 'color' values and get the length of it

library(data.table)
CJ(group1 = unique(x$groupID), group2 = unique(x$groupID))[,
   .(group1, group2, intersectionLength = unlist(Map(function(u, v) 
   length(intersect(x$color[x$groupID == u], 
      x$color[x$groupID == v])), group1, group2)))]
4
chinsoon12 On

Here is another option to remove dupes in cases where groups are flipped:

ans <- x[x, on=.(color), allow.cartesian=TRUE][groupID!=i.groupID, 
    .(intersectionLength=uniqueN(color)), 
    .(g1=pmin(groupID, i.groupID), g2=pmax(groupID, i.groupID))]

output:

     g1 g2 intersectionLength
  1: AT AT                  6
  2: AT CZ                  3
  3: AT DO                  6
  4: AT EW                  5
  5: AT FT                  4
 ---                         
347: XL YL                  3
348: XL ZF                  6
349: YL YL                  5
350: YL ZF                  3
351: ZF ZF                  7