I have two dataframes df1
and df2
that contains the edgelist of two networks g1
and g2
containing the same nodes but different connections. For each node I want to compare the jaccard index between the two networks.
I define the function that compute the jaccard index
def compute_jaccard_index(set_1, set_2):
n = len(set_1.intersection(set_2))
return n / float(len(set_1) + len(set_2) - n)
df1
i j
0 0 2
1 0 5
2 1 2
3 2 3
4 2 4
5 2 7
df2
i j
0 0 2
1 0 5
2 0 1
3 1 3
4 2 4
5 2 7
what I am doing is the following:
tmp1 = pd.unique(df1['i'])
tmp2 = pd.unique(df2['i'])
JI = []
for i in tmp1:
tmp11 = df1[df1['i']==i]
tmp22 = df2[df2['i']==i]
set_1 = list(tmp11['j'])
set_2 = list(tmp22['j'])
JI.append(compute_jaccard_index(set_1, set_2))
I am wondering if there is a more efficient way
I've always found it faster to take advantage of scipy's sparse matrices and vectorize the operations rather than depending on python's set functions. Here is a simple function that coverts DataFrame edge lists into sparse matrices (both directed and undirected):
then it is just simple vector operations on the binary adjacency matrices:
For comparison, I've made a random pandas edge list function and wrapped your code into the following functions:
We can then compare the runtime by making two random networks:
And calculating the Jaccard similarity for each node using your set based method:
And the sparse method (including the overhead of converting to sparse matrices):
As you can see, the sparse matrix code is about 1000 times faster.