Pairwise cohen's kappa of values in two dataframes

110 Views Asked by At

I have two dataframes that look like the toy examples below:

data1 = {'subject': ['A', 'B', 'C', 'D'],
         'group': ['red', 'red', 'blue', 'blue'],
         'lists': [[0, 1, 1], [0, 0, 0], [1, 1, 1], [0, 1, 0]]}

data2 = {'subject': ['a', 'b', 'c', 'd'],
         'group': ['red', 'red', 'blue', 'blue'],
         'lists': [[0, 1, 0], [1, 1, 0], [1, 0, 1], [1, 1, 0]]}

df1 = pd.DataFrame(data1)
df2 = pd.DataFrame(data2)

I would like to calculate the cohen's kappa score for each pair of subjects. For example, I would like to calculate the cohen's kappa scores for subject "A" in df1 against subjects "a", "b", and "c" in df2... and onwards. Like this:

from sklearn.metrics import cohen_kappa_score
cohen_kappa_score(df1['lists'][0], df2['lists'][0])
cohen_kappa_score(df1['lists'][0], df2['lists'][1])
cohen_kappa_score(df1['lists'][0], df2['lists'][2])
...

Importantly, I would like to represent these pairwise cohen's kappa scores in a new dataframe where both the columns and rows would be all the subjects ("A", "B", "C", "a", "b", "c"), so that I can see whether these scores are more consist between dataframes or within dataframes. I will eventually convert this dataframe into a heatmap organized by "group".

This post for a similar R problem looks promising but I don't know how to implement this in python. Similarly, I have not yet figured out how to implement this python solution, which appears similar enough.

3

There are 3 best solutions below

2
On BEST ANSWER

Use concat and pdist:

import numpy as np
from scipy.spatial.distance import pdist, squareform
from sklearn.metrics import cohen_kappa_score

s = (pd.concat([df1, df2])
       .set_index(['subject', 'group'])['lists']
     )

out = pd.DataFrame(squareform(pdist(np.vstack(s.to_list()),
                                    cohen_kappa_score)),
                   index=s.index, columns=s.index)

print(out)

Output:

subject          A    B    C    D    a    b    c    d
group          red  red blue blue  red  red blue blue
subject group                                        
A       red    0.0  0.0  0.0  0.4  0.4 -0.5 -0.5 -0.5
B       red    0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0
C       blue   0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0
D       blue   0.4  0.0  0.0  0.0  1.0  0.4 -0.8  0.4
a       red    0.4  0.0  0.0  1.0  0.0  0.4 -0.8  0.4
b       red   -0.5  0.0  0.0  0.4  0.4  0.0 -0.5  1.0
c       blue  -0.5  0.0  0.0 -0.8 -0.8 -0.5  0.0 -0.5
d       blue  -0.5  0.0  0.0  0.4  0.4  1.0 -0.5  0.0
2
On

You could first merge, to generate the cross product then do cohen_kappa_score:

(np.vectorize(cohen_kappa_score)(*df1.merge(df2, on=None, how='cross')
  .filter(regex='lists').to_numpy().T))

array([ 0.4, -0.5, -0.5, -0.5,  0. ,  0. ,  0. ,  0. ,  0. ,  0. ,  0. ,
        0. ,  1. ,  0.4, -0.8,  0.4])

Easiest way is to use list-comprehension:

[cohen_kappa_score(i,j) for i in df1['lists'] for j in df2['lists']]

[0.3999999999999999, -0.5, -0.5, -0.5, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.3999999999999999, -0.8000000000000003, 0.3999999999999999]

Edit:

pd.DataFrame({i.subject:{j.subject:cohen_kappa_score(i.lists,j.lists) for j in 
    df2.itertuples()} for i in df1.itertuples()})
     A    B    C    D
a  0.4  0.0  0.0  1.0
b -0.5  0.0  0.0  0.4
c -0.5  0.0  0.0 -0.8
d -0.5  0.0  0.0  0.4
2
On
from sklearn.metrics import cohen_kappa_score
import pandas as pd
import numpy as np


def calculate_kappa(df1, df2):
    df = pd.concat([df1, df2]).reset_index(drop=True)
    kappa_scores = []
    for i, row1 in df.iterrows():
        for j, row2 in df.iterrows():
            kappa = cohen_kappa_score(row1["lists"], row2["lists"])
            kappa_scores.append(kappa)
    kappa_scores = np.array(kappa_scores)
    kappa_matrix = kappa_scores.reshape((len(df), len(df)))
    kappa_df = pd.DataFrame(
        kappa_matrix,
        index=df.set_index(["subject", "group"]).index,
        columns=df.set_index(["subject", "group"]).index,
    )
    return kappa_df


data1 = {
    "subject": ["A", "B", "C", "D"],
    "group": ["red", "red", "blue", "blue"],
    "lists": [[0, 1, 1], [0, 0, 0], [1, 1, 1], [0, 1, 0]],
}
data2 = {
    "subject": ["a", "b", "c", "d"],
    "group": ["red", "red", "blue", "blue"],
    "lists": [[0, 1, 0], [1, 1, 0], [1, 0, 1], [1, 1, 0]],
}
df1 = pd.DataFrame(data1)
df2 = pd.DataFrame(data2)
kappa_df = calculate_kappa(df1, df2)
print(kappa_df)

Output

subject          A    B    C    D    a    b    c    d
group          red  red blue blue  red  red blue blue
subject group                                        
A       red    1.0  0.0  0.0  0.4  0.4 -0.5 -0.5 -0.5
B       red    0.0  NaN  0.0  0.0  0.0  0.0  0.0  0.0
C       blue   0.0  0.0  NaN  0.0  0.0  0.0  0.0  0.0
D       blue   0.4  0.0  0.0  1.0  1.0  0.4 -0.8  0.4
a       red    0.4  0.0  0.0  1.0  1.0  0.4 -0.8  0.4
b       red   -0.5  0.0  0.0  0.4  0.4  1.0 -0.5  1.0
c       blue  -0.5  0.0  0.0 -0.8 -0.8 -0.5  1.0 -0.5
d       blue  -0.5  0.0  0.0  0.4  0.4  1.0 -0.5  1.0

Because B and C are [0,0,0] and [1,1,1] so the values are NaN