Get similarity within a column based on another column

36 Views Asked by At

I have a table with three columns: Source, Target, Similarity. The first two are strings, the last one is a float. This table came about by comparing source elements and target elements and finding their similarity. For each source element, there are the 20 most similar target elements. It looks like this (not full table):

Source Target Similarity
Source_1 Target_23 0.82
Source_1 Target_12 0.32
Source_1 Target_2 0.02
Source_2 Target_23 0.72
Source_2 Target_14 0.52
Source_2 Target_12 0.12

Based on this information, I would like for each source elemet to calculate its 5 most similar other source elements. The idea is that I don't want to calculate similarity within source elements as it's a computationally expensive process; if source_1 and source_2 are highly similar to the same target element, then they should be similar to each other as well.

What's the best way of doing this?

I've tried ranking in descending order based on similarity the targets for each source and selecting the source elements that have the same top 5 most similar targets irrespective of their order in the top 5. I feel that there is a better way to use the similarity score rather than just for ranking.

I've also tried finding the sources that have the top target of the selected Source element in their top 5 targets, and assuming the similarity is more than a threshold. This worked well, but again I feel I'm not utilising all the information I have by neglecting the remaining targets for each source. (see code snippet for this below)

    selected_source = "Source 1"
    for i in range(len(Source)):
        if similarity_json[selected_source][Target][0] in similarity_json[Source[i]][Target][:5] and similarity_json[Source[i][similarity]>0.5:
            source_similars.append(Source[i])
0

There are 0 best solutions below