How do I count the number of times any two given values occur together in a row in R?

353 Views Asked by At

I am working with a data frame like this, with the ID column indicating a specific publication:

ID AuthorA AuthorB AuthorC
1   Chris   Lee     Jill
2   Jill    Tom     Lee 
3   Tom     Chris   Lee
4   Lee     Jill    NA
5   Jill    Chris   NA

I would like to generate a source, target, and count column for a social network analysis. In other words, count the number of times two authors appear on the same publication. The data frame I am working with, however, has 18 author columns. This should be the final output:

Source Target Count
Chris   Lee     2
Chris   Jill    2
Lee     Jill    3
Jill    Tom     1
Tom     Lee     2
Tom     Chris   1
2

There are 2 best solutions below

7
On BEST ANSWER

For every row you can create all combination of names and count their frequency with table.

result <- stack(table(unlist(apply(df[-1], 1, function(x) {
                 vec <- na.omit(x)
                 if(length(vec) < 2) return(NULL)
                  combn(vec, 2, function(y) paste0(sort(y), collapse = '-'))
            }))))[2:1]
result
#         ind values
#1 Chris-Jill      2
#2  Chris-Lee      2
#3  Chris-Tom      1
#4   Jill-Lee      3
#5   Jill-Tom      1
#6    Lee-Tom      2

To get them in separate columns you can use separate :

tidyr::separate(result, ind, c('Source', 'Target'), sep = '-')

#  Source Target values
#2  Chris   Jill      2
#3  Chris    Lee      2
#4  Chris    Tom      1
#6   Jill    Lee      3
#7   Jill    Tom      1
#9    Lee    Tom      2
0
On

Here's another approach in base R. It involves using crossprod which gets the frequencies you're after, just not in the format you're after.

Here's an example of what the crossprod output looks like:

crossprod(table(cbind(1:nrow(m), stack(lapply(m[, -1], as.character)))[, 1:2]))
#        values
# values  Chris Jill Lee Tom
#   Chris     3    2   2   1
#   Jill      2    4   3   1
#   Lee       2    3   4   2
#   Tom       1    1   2   2

Looking at the above, you can see that the values you're interested in are in the upper or lower triangle of the resulting matrix. You can use this information to write a function like the following to get your tabulations by pairs of values:

pair_table <- function(data) {
  m <- cbind(1:nrow(data), stack(lapply(data, as.character)))
  count <- crossprod(table(m[, 1:2]))
  count[upper.tri(count, diag = TRUE)] <- NA
  na.omit(as.data.frame(as.table(count)))
}

Using the function:

pair_table(m[, -1])
#    values values.1 Freq
# 2    Jill    Chris    2
# 3     Lee    Chris    2
# 4     Tom    Chris    1
# 7     Lee     Jill    3
# 8     Tom     Jill    1
# 12    Tom      Lee    2

Sample data used in this answer.

 m <- structure(list(ID = 1:5, AuthorA = structure(c(1L, 2L, 4L, 3L,                       
     2L), .Label = c("Chris", "Jill", "Lee", "Tom"), class = "factor"),                    
         AuthorB = structure(c(3L, 4L, 1L, 2L, 1L), .Label = c("Chris",                    
         "Jill", "Lee", "Tom"), class = "factor"), AuthorC = structure(c(1L,               
         2L, 2L, NA, NA), .Label = c("Jill", "Lee"), class = "factor")), row.names = c(NA, 
     5L), class = "data.frame")