I have a dataframe of users and their friends that looks like:
user_id | friend_id
1 3
1 4
2 3
2 5
3 4
I want to write a function in python to compute the number of mutual friends for each pair:
user_id | friend_id | num_mutual
1 3 1
1 4 1
2 3 0
2 5 0
3 4 1
Currently I have:
def find_mutual(df):
num_mutual = []
for i in range(len(df)):
user, friend = df.loc[i, 'user_id'], df.loc[i, 'friend_id']
user_list = df[df.user_id == user].friend_id.tolist() + df[df.friend_id == user].user_id.tolist()
friend_list = df[df.user_id == friend].friend_id.tolist() + df[df.friend_id == friend].user_id.tolist()
mutual = len(list(set(user_list) & set(friend_list)))
num_mutual.append(mutual)
return num_mutual
It works fine for small datasets, but I'm running it on a dataset with millions of rows. It takes forever to run everything. I know it's not the ideal way to find the count. Is there a better algorithm in Python? Thanks in advance!
The [ugly] idea is to construct a 4 point path that starts with a
user_idand ends with the sameuser_id. If such a path exists, then 2 starting points have mutual friends.We start with:
Then you can do:
A more elegant and straightforward way to use a graph approach
Then you can identify number of mutual friends as number of paths for 2 adjacent nodes (2 friends) for which a 3 node path exists: