Intersection of items between groups in a data.table

309 Views Asked by At

I have a data.table with two columns, one with a groupID and the other with a color. I want to find the length of the intersections or a pairwise intersection operation between all groups. There are similar posts online but nothing exactly as far as what I am looking for.

require(data.table)


set.seed(1)
x <- data.table(
  groupID = paste0(sample(LETTERS), sample(LETTERS, replace = TRUE)),
  color = sapply(1:length(LETTERS), function(x) sample(colors()[1:10])[1:sample(5:10)[1]])
)

x <- x[, .(color = unlist(color)), keyby = groupID]

The table below doesn't have correct values but it would look something like this:

groups <- x[, .N, keyby = groupID][,groupID]; results <- CJ(groups, groups)
results[, intersectionLength := sapply(1:nrow(results), function(x) sample(5:10)[1])]

EDIT

This post has a similar question. How could I apply this to my problem?

2

There are 2 best solutions below

0
On BEST ANSWER

Here is one optioin with Map to compare the pairwise elements of group columns to extract the intersecting 'color' values and get the length of it

library(data.table)
CJ(group1 = unique(x$groupID), group2 = unique(x$groupID))[,
   .(group1, group2, intersectionLength = unlist(Map(function(u, v) 
   length(intersect(x$color[x$groupID == u], 
      x$color[x$groupID == v])), group1, group2)))]
4
On

Here is another option to remove dupes in cases where groups are flipped:

ans <- x[x, on=.(color), allow.cartesian=TRUE][groupID!=i.groupID, 
    .(intersectionLength=uniqueN(color)), 
    .(g1=pmin(groupID, i.groupID), g2=pmax(groupID, i.groupID))]

output:

     g1 g2 intersectionLength
  1: AT AT                  6
  2: AT CZ                  3
  3: AT DO                  6
  4: AT EW                  5
  5: AT FT                  4
 ---                         
347: XL YL                  3
348: XL ZF                  6
349: YL YL                  5
350: YL ZF                  3
351: ZF ZF                  7