How to filter a collection of objects by field value?

598 Views Asked by At

How in Python to organize and filter a collection of objects by a field value? I need to filter by being equal to an exact value and by being less than a value.

And how to do it effectively? If I store my objects in a list I need to iterate over a whole list, potentially holding hundreds of thousands of objects.

@dataclass
class Person:
  name: str
  salary: float
  is_boss: bool


# if to store objects in a list...
collection = [Person("Jack", 50000, 0), ..., Person("Jane", 120000, 1)]

# filtering in O(n), sloooooow
target = 100000
filtered_collection = [x for x in collection if salary < target]

PS: Actually my use case is group by by a certain field, i.e. is_boss and filter by another, i.e. salary. How to do that? Should I user itertools.groupby over sorted lists and make my objects comparable?

2

There are 2 best solutions below

2
On

Arrays have a sort method - All you have to do is create a function that detirmes if an object is greater than another object - let me show you

class Foo:
    def __init__(bar):
        this.bar = bar

fooArray = [Foo(10),Foo(8),Foo(9)]
def sortFoo(foo):
    return foo.bar

fooArray.sort(key=sortFoo)
7
On

If you maintain your list in sorted order (which ideally means few insertions or removals, because mid-list insertion/removal is itself O(n)), you can find the set of Persons below a given salary with the bisect module.

from bisect import bisect
from operator import attrgetter

# if to store objects in a list...
collection = [Person("Jack", 50000, 0), ..., Person("Jane", 120000, 1)]
collection.sort(key=attrgetter('salary'))  # O(n log n) initial sort

# filtering searches in O(log n):
target = 100000
filtered_collection = collection[:bisect(collection, target, key=attrgetter('salary'))]

Note: The key argument to the various bisect module functions is only supported as of 3.10. In prior versions, you'd need to define the rich comparison operators for Person in terms of salary and search for a faked out Person object, or maintain ugly separate sorted lists, one of salary alone, and a parallel list of the Person objects.

For adding individual elements to the collection, you could use bisect's insort function. Or you could just add a bunch of items to the end of the list in bulk and resort it on the same key as before (Python's sorting algorithm, TimSort, gets near O(n) performance when the collection is mostly in order already, so the cost is not as high as you might think).

I'll note that in practice, this sort of scenario (massive data that can be arbitrarily ordered by multiple fields) usually calls for a database; you might consider using sqlite3 (eventually switching to a more production-grade database like MySQL or PostGres if needed), which, with appropriate indexes defined, would let you do O(log n) SELECTs on any indexed field; you could convert to Person objects on extraction for the data you actually need to work with. The B-trees that true DBMS solutions provide get you O(log n) effort for inserts, deletes and selects on the index fields, where Python built-in collection types make you choose; only one of insertion/deletion or searching can be truly O(log n), while the other is O(n).