Find correlation between *values* in columns

109 Views Asked by At

Consider a dataset where each row is a basket of 3 fruits.

library(data.table)
baskets <- data.table(fruit_1 = c('orange', 'apple', 'apple', 'pear')
                      ,fruit_2 = c('apple', 'pear', 'kiwi', 'kiwi')
                      ,fruit_3 = c('pear', 'kiwi', 'blueberry', 'blueberry'))

What would be an efficient way to calculate correlations between different fruits? In other words, how often different fruits appear in the same basket/row together? I'm trying to get the pairwise correlation for every pair of 2 fruits (for example, "apples and pears", "apples and kiwis", etc.).

The best approach I can think of now is to make indicator variables/binary columns for each fruit and then do the correlation of those. Is there a better way than that, computationally or otherwise?

EDIT: I updated this part to show a table that looks like my desired result. It would probably want "agreement/disagreement score" or something instead of the correlation, but you get the idea.

baskets$apple = 0
baskets[fruit_1=='apple']$apple = 1
baskets[fruit_2=='apple']$apple = 1
baskets[fruit_3=='apple']$apple = 1

baskets$pear = 0
baskets[fruit_1=='pear']$pear = 1
baskets[fruit_2=='pear']$pear = 1
baskets[fruit_3=='pear']$pear = 1

baskets$kiwi = 0
baskets[fruit_1=='kiwi']$kiwi = 1
baskets[fruit_2=='kiwi']$kiwi = 1
baskets[fruit_3=='kiwi']$kiwi = 1

#looking for a table like this, but with every combination of fruit and imagining thousands of rows
desired_result = data.frame(fruit_1 = c('apple', 'pear', 'kiwi'),
                            fruit_2 = c('pear', 'kiwi', 'apple'),
                            similarity = c(cor(baskets$apple, baskets$pear),
                                           cor(baskets$pear, baskets$kiwi),
                                           cor(baskets$kiwi, baskets$apple)
                                           )
                            )


This feels like an okay solution, but not a great one. So I wanted to see what better options there are. Data.table is highly preferable because I'm much better at that but I'm open to whatever.

2

There are 2 best solutions below

0
ThomasIsCoding On

You can try cor along with as.data.frame.table, e.g.,

subset(
    as.data.frame.table(
        cor(table(row(baskets), unlist(baskets))),
        responseName = "similarity",
        stringsAsFactors = FALSE
    ),
    Var1 < Var2
)

and you will obtain

        Var1      Var2 similarity
6      apple blueberry -0.5773503
11     apple      kiwi -0.3333333
12 blueberry      kiwi  0.5773503
16     apple    orange  0.3333333
17 blueberry    orange -0.5773503
18      kiwi    orange -1.0000000
21     apple      pear -0.3333333
22 blueberry      pear -0.5773503
23      kiwi      pear -0.3333333
24    orange      pear  0.3333333
0
thelatemail On

I think a co-occurrence table, generated using crossprod, might be helpful, to get the counts of how often values appear, and appear together.

out <- crossprod(table(
    data.frame(basket=seq_len(nrow(baskets)), fruit=unlist(baskets))
))
out
#           fruit
#fruit       apple blueberry kiwi orange pear
#  apple         3         1    2      1    2
#  blueberry     1         2    2      0    1
#  kiwi          2         2    3      0    2
#  orange        1         0    0      1    1
#  pear          2         1    2      1    3

The diagonal will be the raw frequency data:

diag(out)
#    apple blueberry      kiwi    orange      pear 
#        3         2         3         1         3 

You can get it in a long form too if you like:

as.data.frame.table(out)[lower.tri(out),]
#       fruit   fruit.1 Freq
#2  blueberry     apple    1
#3       kiwi     apple    2
#4     orange     apple    1
#5       pear     apple    2
#8       kiwi blueberry    2
#9     orange blueberry    0
#10      pear blueberry    1
#14    orange      kiwi    0
#15      pear      kiwi    2
#20      pear    orange    1

And if you want to include the diagonal, you can do so:

as.data.frame.table(out)[lower.tri(out, diag=TRUE),]
#       fruit   fruit.1 Freq
#1      apple     apple    3
#2  blueberry     apple    1
#...