Obtaining a HashMap or dictionary and a diagram in Python to visualize the overlaps between multiple lists

83 Views Asked by At

Context: I roughly have a dictionary of about 130 lists in the form of a key and a list of indexes.

{‘key1’:[0,1,2], ‘key2’: [2, 3, 4], ‘key3’:[5, 6],…, ‘key130’:[0, 450, 1103, 500,…]}

Lists are all different sizes.

This is a two-part problem where:

  1. I want some form of data structure to store the number of overlaps between lists

  2. If possible, I want a diagram that shows the overlap

PART 1:

The most similar StackOverflow questions answers were that we could find list similarities by utilizing set.intersection

List1 = [10,10,11,12,15,16,18,19]

List2 = [10,11,13,15,16,19,20]

List3 = [10,11,11,12,15,19,21,23]

print(set(List1).intersection(List2)) #compare between list 2 and 3

Which gives you:

set([10, 11, 15, 16, 19])

I could then use a for loop to traverse through each list to compare it with the next list in the dictionary and get the length of the list. This would then give me a dictionary such as:

{‘key1_key2’:1, ‘key2_key3’:0, ‘key3_key4’…, ‘key130_key1’: [29]}

PART 2:

I have in my head that a comparison table would be the best to visualize the similarities:


    Key1    Key2    Key3    …   Key130
Key1    X   X   X       X
Key2    0   X   X       X
Key3    4   6   X       X
…               X   …
Key130                  X

However, I couldn’t find many results on how this can be achieved.

Another option was UpSetPlot as it can allow for pretty nice yet perhaps a little excessive comparison in this case: https://upsetplot.readthedocs.io/en/stable/

Of course, I’m sure both diagrams would need the similarities result to be stored a bit differently? I’m not too sure for the Comparison Table but UpSetPlot would need the dictionary (?) to be a pandaSeries. I would be interested in both diagrams to test how it would look.

Reproducible Example:

{'key1': [10,10,11,12,15,16,18,19], 'key2': [10,11,13,15,16,19,20], 'key3':[10,11,11,12,15,19,21,23], 'key4':[], 'key5':[0], 'key6':[10,55,66,77]}

Some of the more useful resources I looked at:
How to compare more than 2 Lists in Python? Python -Intersection of multiple lists? Python comparing multiple lists into Comparison Table

If there are some other sites that I missed that would be applicable to this Q, please let me know. Thank you in advance!

2

There are 2 best solutions below

0
On BEST ANSWER
import numpy as np
import pandas as pd

d = {'key1':[0,1,2], 'key2': [2, 3, 4], 'key3':[5, 6]}
s = []
[s.append(list(set(x) & set(y))) for x in d.values() for y in d.values()]

matrix1 = np.array(s, dtype = object)
matrix2 = matrix1.reshape(int(np.sqrt(len(matrix1))),int(np.sqrt(len(matrix1))))
matrix2 = np.vectorize(len)(matrix2)

df = pd.DataFrame(matrix2)
df.columns = d.keys()
df.index = d.keys()

print(df)

Output:

      key1  key2  key3
key1     3     1     0
key2     1     3     0
key3     0     0     2

Definitely not the solution with the best performance. But it is easy to implement.

0
On

The structure I would use would be a nested set of dictionaries, where the first level was all the comparisons for a given first key, and each value in the first dictionary is a dictionary containing the intersection of that value's key with each of the other keys (including itself).

You can optionally choose to store the comparison length for each pair of keys just once in the dictionary. If you do this, then when you look up a comparison, you have to consider the keys in both orders when looking up their intersections. If a comparison doesn't exist for two keys in one order (ex: [a][b]), then it will contains a comparison in the other order ([b][a]). If you'd rather be able to look up the keys without worrying about the order, you can store the values twice so that either order is valid.

Here's code that does all this, including having the option to create a sparse data structure that only contains the counts in one order for each key pair:

from pprint import pprint

# Build the comparison structure
def build_comparisons(input, sparse=True):
    comparisons = {}
    for key1 in input.keys():
        comparisons[key1] = {}
        for key2 in input.keys():
            if key2 in comparisons and key1 in comparisons[key2]:
                # If we've already computed the intersection for this pair of keys...
                if sparse:
                    # For a sparse structure, don't include values for keys that
                    # already exist in the structure in the opposite order of keys
                    continue
                # Use the already computed value
                comparisons[key1][key2] = comparisons[key2][key1]
            elif key1 == key2:
                # If the keys are the same, use the length of that key's value
                comparisons[key1][key2] = len(input[key1])
            else:
                # Compute the intersection between the two keys and take its length
                l1 = set(input[key1])
                l2 = set(input[key2])
                comparisons[key1][key2] = len(l1.intersection(l2))
    return comparisons

# Look up the intersection for a keypair.  This function is only necessary
# to make it easier to look up values in a sparse structure
def get_comparison(comparisons, key1, key2):
    if key1 in comparisons and key2 in comparisons[key1]:
        return comparisons[key1][key2]
    return comparisons[key2][key1]

# Display the lengths of each intersection in a table format
def display_comparisons(comparisons):

    cell_width = max([len(key) for key in comparisons.keys()]) + 2

    def print_cell(val):
        print(f"{str(val):<{cell_width}}", end='')

    print_cell('')
    [print_cell(key) for key in comparisons.keys()]
    print()
    for key1 in comparisons.keys():
        print_cell(key1)
        for key2 in comparisons.keys():
            print_cell(get_comparison(comparisons, key1, key2))
        print()

input = {'key1': [10,10,11,12,15,16,18,19], 'key2': [10,11,13,15,16,19,20], 'key3':[10,11,11,12,15,19,21,23], 'key4':[], 'key5':[0], 'key6':[10,55,66,77]}

# Build the comparison structure
comparisons = build_comparisons(input)

# Show the resulting data structure
pprint(comparisons)

print()

# Display the lengths of each intersection in a table format
display_comparisons(comparisons)

Here's the resulting output:

{'key1': {'key1': 8, 'key2': 5, 'key3': 5, 'key4': 0, 'key5': 0, 'key6': 1},
 'key2': {'key2': 7, 'key3': 4, 'key4': 0, 'key5': 0, 'key6': 1},
 'key3': {'key3': 8, 'key4': 0, 'key5': 0, 'key6': 1},
 'key4': {'key4': 0, 'key5': 0, 'key6': 0},
 'key5': {'key5': 1, 'key6': 0},
 'key6': {'key6': 4}}

      key1  key2  key3  key4  key5  key6  
key1  8     5     5     0     0     1     
key2  5     7     4     0     0     1     
key3  5     4     8     0     0     1     
key4  0     0     0     0     0     0     
key5  0     0     0     0     1     0     
key6  1     1     1     0     0     4     

If you set sparse to False when computing the data structure, then when you print it, you get this instead:

{'key1': {'key1': 8, 'key2': 5, 'key3': 5, 'key4': 0, 'key5': 0, 'key6': 1},
 'key2': {'key1': 5, 'key2': 7, 'key3': 4, 'key4': 0, 'key5': 0, 'key6': 1},
 'key3': {'key1': 5, 'key2': 4, 'key3': 8, 'key4': 0, 'key5': 0, 'key6': 1},
 'key4': {'key1': 0, 'key2': 0, 'key3': 0, 'key4': 0, 'key5': 0, 'key6': 0},
 'key5': {'key1': 0, 'key2': 0, 'key3': 0, 'key4': 0, 'key5': 1, 'key6': 0},
 'key6': {'key1': 1, 'key2': 1, 'key3': 1, 'key4': 0, 'key5': 0, 'key6': 4}}