Context: I roughly have a dictionary of about 130 lists in the form of a key and a list of indexes.
{‘key1’:[0,1,2], ‘key2’: [2, 3, 4], ‘key3’:[5, 6],…, ‘key130’:[0, 450, 1103, 500,…]}
Lists are all different sizes.
This is a two-part problem where:
I want some form of data structure to store the number of overlaps between lists
If possible, I want a diagram that shows the overlap
PART 1:
The most similar StackOverflow questions answers were that we could find list similarities by utilizing set.intersection
List1 = [10,10,11,12,15,16,18,19]
List2 = [10,11,13,15,16,19,20]
List3 = [10,11,11,12,15,19,21,23]
print(set(List1).intersection(List2)) #compare between list 2 and 3
Which gives you:
set([10, 11, 15, 16, 19])
I could then use a for loop to traverse through each list to compare it with the next list in the dictionary and get the length of the list. This would then give me a dictionary such as:
{‘key1_key2’:1, ‘key2_key3’:0, ‘key3_key4’…, ‘key130_key1’: [29]}
PART 2:
I have in my head that a comparison table would be the best to visualize the similarities:
Key1 Key2 Key3 … Key130
Key1 X X X X
Key2 0 X X X
Key3 4 6 X X
… X …
Key130 X
However, I couldn’t find many results on how this can be achieved.
Another option was UpSetPlot as it can allow for pretty nice yet perhaps a little excessive comparison in this case: https://upsetplot.readthedocs.io/en/stable/
Of course, I’m sure both diagrams would need the similarities result to be stored a bit differently? I’m not too sure for the Comparison Table but UpSetPlot would need the dictionary (?) to be a pandaSeries. I would be interested in both diagrams to test how it would look.
Reproducible Example:
{'key1': [10,10,11,12,15,16,18,19], 'key2': [10,11,13,15,16,19,20], 'key3':[10,11,11,12,15,19,21,23], 'key4':[], 'key5':[0], 'key6':[10,55,66,77]}
Some of the more useful resources I looked at:
How to compare more than 2 Lists in Python? Python -Intersection of multiple lists? Python comparing multiple lists into Comparison Table
If there are some other sites that I missed that would be applicable to this Q, please let me know. Thank you in advance!
Output:
Definitely not the solution with the best performance. But it is easy to implement.