Sorting with equivalence classes in Python

Question

Sorting with equivalence classes in Python

700 Views Asked by Draconis At 20 July 2019 at 20:56

Suppose I have a custom data structure Data that reveals two relevant properties: tag indicates which equivalence class this item belongs in, and rank indicates how good this item is.

I have an unordered set of Data objects, and want to retrieve the n objects with the highest rank—but with at most one object from each equivalence class.

(Objects in the same equivalence class don't necessarily compare equal, and don't necessarily have the same rank, but I don't want any two elements in my output to come from the same class. In other words, the relation that produces these equivalence classes isn't ==.)

My first approach looks something like this:

Sort the list by descending rank
Create an empty set s
For each element in the list:
- Check if its tag is in s; if so, move on
- Add its tag to s
- Yield that element
- If we've yielded n elements, stop

However, this feels awkward, like there should be some better way (potentially using itertools and higher-order functions). The order of the resulting n elements isn't important.

What's the Pythonic solution to this problem?

Toy example:

Data = namedtuple('Data', ('tag', 'rank'))
n = 3

algorithm_input = { Data('a', 200), Data('a', 100), Data('b', 50), Data('c', 10), Data('d', 5) }
expected_output = { Data('a', 200), Data('b', 50), Data('c', 10) }

Original Q&A

There are 4 best solutions below

logicOnAbstractions On 20 July 2019 at 21:41

If it's a class definition you control, I believe the most Pythonic way would be this:

from random import shuffle

class Data:

    def __init__(self, order=1):
        self.order = order

    def __repr__(self):
        return "Order: " + str(self.order)

if __name__ == '__main__':
    import sys
    d = []
    for i in range(0,10):
        d.append(Data(order=i))
    shuffle(d)

    print(d)

    print(sorted(d, key=lambda data: data.order))

Output:

[Order: 5, Order: 2, Order: 6, Order: 0, Order: 4, Order: 7, Order: 3, Order: 9, Order: 1, Order: 8]
[Order: 0, Order: 1, Order: 2, Order: 3, Order: 4, Order: 5, Order: 6, Order: 7, Order: 8, Order: 9]

So essentially, add an attribute to sort by to the class. Define the string rep (just to make it easier to see what's going on). Then use python's sorted() on a list of those object with a lambda function to indicate the attribute that each object should be sorted against.

Note: the comparison for that attribute type must be defined - here it's an int. In case the attribute is not defined, you would have to implement gt, let etc... for that attribute. See the docs for details.

jferard On 20 July 2019 at 22:04

Create a dict max_by_tag that stores the item with the max rank by tag:

>>> from collections import namedtuple
>>> Data = namedtuple('Data', ('tag', 'rank'))
>>> n = 3
>>> algorithm_input = { Data('a', 200), Data('a', 100), Data('b', 50), Data('c', 10), Data('d', 5) }
>>> max_by_tag = {}
>>> for item in algorithm_input:
...     if item.tag not in max_by_tag or item.rank > max_by_tag[item.tag].rank:
...         max_by_tag[item.tag] = item

>>> max_by_tag
{'a': Data(tag='a', rank=200), 'b': Data(tag='b', rank=50), 'c': Data(tag='c', rank=10), 'd': Data(tag='d', rank=5)}

Then use the heapq module:

>>> import heapq
>>> heapq.nlargest(n, max_by_tag.values(), key=lambda data: data.rank)
[Data(tag='a', rank=200), Data(tag='b', rank=50), Data(tag='c', rank=10)]

Sunitha On 20 July 2019 at 22:40

Store the sorted input in a OrderedDict (with tag as the key and Data as the value). This would result in only one Data from each equivalent class being stored in the OrderedDict

>>> from collections import namedtuple, OrderedDict
>>> Data = namedtuple('Data', ('tag', 'rank'))
>>> n = 3
>>> algorithm_input = { Data('a', 200), Data('a', 100), Data('b', 50), Data('c', 10), Data('d', 5) }
>>> 
>>> set(list(OrderedDict((d.tag, d) for d in sorted(algorithm_input)).values())[:n])
{Data(tag='b', rank=50), Data(tag='a', rank=200), Data(tag='c', rank=10)}

**Andrej Kesely** · Accepted Answer · 2019-07-20T21:32:18.810000

You could use itertools.groupby (doc). First we sort the items by your criteria and then group them by tag (and store only first item from each group):

from itertools import groupby
from collections import namedtuple

Data = namedtuple('Data', ('tag', 'rank'))

n = 3

algorithm_input = { Data('a', 200), Data('a', 100), Data('b', 50), Data('c', 10), Data('d', 5) }

# 1. sort the data by rank (descending) and tag (ascending)
s = sorted(algorithm_input, key=lambda k: (-k.rank, k.tag))

# 2. group the data by tag and store first item from each group to 'out', limit the number of groups to 'n'
out = []
for (_, g), _ in zip(groupby(s, lambda k: k.tag), range(n)):
    out.append(next(g))

print(out)

Prints:

[Data(tag='a', rank=200), Data(tag='b', rank=50), Data(tag='c', rank=10)]

EDIT: Changed the sorting key.

Sorting with equivalence classes in Python

What's the Pythonic solution to this problem?

There are 4 best solutions below

Related Questions in PYTHON

Related Questions in ALGORITHM

Related Questions in SORTING

Related Questions in EQUIVALENCE-CLASSES

Trending Questions

Popular # Hahtags

Popular Questions