Combing Subsets into Larger Sets

52 Views Asked by At

Ok I am super stuck on this one. I need a function that does the following.

Takes in a list of list.

input = [['A'], ['A', 'B'], ['A', 'B', 'C'], ['A', 'B'], ['X', 'Y'], ['A', 'B', 'C'], ['X'], ['A'], ['X', 'A', 'B']

Then reduce the list down to supersets while keeping count of the number of subsets that where rolled into the supersets. It would continue to do this until no more supersets could be made.

The rules are the order of items in the subsets have to stay in order and have to match in order in the larger superset. So for this input it would reduce down to in dictionary form:

output = {['A', 'B', 'C'] : 6, ['X', 'Y'] : 2, ['X', 'A', 'B'] : 5}.

Basically the superset counts itself and duplicates of itself and all subsets it is inclusive of So.... ['A', 'B', 'C'] = 6 (['A', 'B', 'C']*2 + ['A', 'B']*2 + ['A']*2), ['X', 'Y'] = 3 (['X', 'Y']*1 + ['X']*1), ['X', 'A', 'B'] = 5 (['X', 'A', 'B']*1 + ['A', 'B']*2 + ['X']*1 + ['A']*1)

The subsets that have been rolled up should not be part of the final output. Once they are rolled up they are destroyed essentially. The above example is hardcoded whereas my actual dataset contains 1000's of subsets.

If I am not explaining it clearly let me try again. Each of these sets represents a series of locations a person visited and in what order they visited them. What I am trying to do is figure out which supersets would represent these locations and subsequently the number of people that superset represents. So Jack when to Atlanta, then Chicago, then Miami. Jill went to Chicago then Miami. and Fred went to Atlanta. The superset Atlanta, Chicago, Miami would repent there are 3 people associated with the Atlanta, Chicago, Miami route. If there was a larger superset of Atlanta, Chicago, Miami, Denver. Then these 3 people would also be associated without route, and since that route is inclusive of Atlanta, Chicago, Miami I no longer need it. If there is a route Atlanta, Chicago, Miami, Seattle, these three people would also be associated with this route and the Atlanta, Chicago, Miami route is still no longer need. That might have made if more confusing if so, sorry. But that is the just of what I am trying to do.

This problem is killing me so if you can develop this function that would be super helpful. Happy to send some BTC (not much but a thank you amount) to the first one who gets it right.

I have tried a million combos and can wrap my head around how to proceed. Major fail also for GPT.

1

There are 1 best solutions below

3
Tim Roberts On

Assuming my rule about deciding the supersets is valid, here is code that does most of the work:

master = [['A'], ['A', 'B'], ['A', 'B', 'C'], ['A', 'B'], ['X', 'Y'], ['A', 'B', 'C'], ['X'], ['A'], ['X', 'A', 'B']]

# Make them tuples so they can be keys.
master = list(map(tuple,master))
mlen = max(len(s) for s in master)
supers = {s:[] for s in master if len(s)==mlen}

def is_subset(sub,sup):
    return all(t in sup for t in sub)

for s in master:
    for t,v in supers.items():
        if is_subset(s,t):
            v.append(s)

from pprint import pprint
pprint(supers)

Output:

{('A', 'B', 'C'): [('A',),
                   ('A', 'B'),
                   ('A', 'B', 'C'),
                   ('A', 'B'),
                   ('A', 'B', 'C'),
                   ('A',)],
 ('X', 'A', 'B'): [('A',),
                   ('A', 'B'),
                   ('A', 'B'),
                   ('X',),
                   ('A',),
                   ('X', 'A', 'B')]}

You might be able to speed this up by converting things to sets instead of tuples, but performance is not likely to be an issue with code like this.

FOLLOWUP

Since you have not specified HOW to choose the supersets, we are forced to guess what you were thinking. Since you didn't like the length-based choice, I have attempted another one here.

In this case, I go through the whole list. If there is a set that is not a subset of any other set, then I make it a superset. The rest of the processing is identical.

master = [['A'], ['A', 'B'], ['A', 'B', 'C'], ['A', 'B'], ['X', 'Y'], ['A', 'B', 'C'], ['X'], ['A'], ['X', 'A', 'B']]

# Make them tuples so they can be keys.
master = list(map(tuple,master))

def is_subset(sub,sup):
    return all(t in sup for t in sub)

# Find all the sets that are not subsets of anyone else.

supers = {}
for a in master:
    for b in master:
        if a == b:
            continue
        if is_subset(a,b):
            break
    else:
        supers[a] = []

for s in master:
    for t,v in supers.items():
        if is_subset(s,t):
            v.append(s)

from pprint import pprint
pprint(supers)

for k,v in supers.items():
    print(len(v), k)

Output:

{('A', 'B', 'C'): [('A',),
                   ('A', 'B'),
                   ('A', 'B', 'C'),
                   ('A', 'B'),
                   ('A', 'B', 'C'),
                   ('A',)],
 ('X', 'A', 'B'): [('A',),
                   ('A', 'B'),
                   ('A', 'B'),
                   ('X',),
                   ('A',),
                   ('X', 'A', 'B')],
 ('X', 'Y'): [('X', 'Y'), ('X',)]}
6 ('A', 'B', 'C')
2 ('X', 'Y')
6 ('X', 'A', 'B')