Sorted set searching and ordering

240 Views Asked by At

Python Sorted Containers package contains SortedSet, that can be initialized as follows:

__init__(iterable=None, key=None)

key: Function used to extract comparison key from values. Sorted set compares values directly when the key function is none.

I want to store some strings that should be sorted by timestamp. If I store tuples like ("string", 1234), where 1234 is some monotonically increasing number like the difference between now and epoch, how can I search for the string in the set, and also keep it sorted using the 2nd element of the tuple? In other words, the timestamp shouldn't be used for searching, and the string shouldn't be used for ordering.

2

There are 2 best solutions below

0
On BEST ANSWER

Keep the timestamps in an auxiliary dict, rather than in the SortedSet itself. For example, something like this:

import sortedcontainers

class StringTimes:
    def __init__(self):
        self.timestamps = {}
        self.strings = sortedcontainers.SortedSet(key=timestamps.__getitem__)

    def add(self, string, time):
        # If the element is already in the set, we need to get rid of it first, so we
        # don't corrupt the ordering when we update the timestamp.
        self.strings.discard(string)

        self.timestamps[string] = time
        self.strings.add(string)

    def remove(self, string):
        self.strings.remove(string)
        del self.timestamps[string]

    def __contains__(self, string):
        return string in self.strings

    def __iter__(self):
        return iter(self.strings)

Since we're using a separate dict with string keys, we don't actually need the set part of SortedSet, so we could save some memory and just use a SortedList. (If new timestamps are guaranteed to be monotonically increasing, as your comment suggests, then we don't need sortedcontainers at all. We can just rely on dict ordering. I wouldn't count on that assumption holding, though - it's too likely to break once you introduce multithreading or need to sync updates across a network or something.)

0
On

Why not use Pandas?

# pip install pandas
import pandas as pd

s = [('A', 3), ('B', 1), ('C', 2), ('D', 5), ('E', 4)]
df = pd.DataFrame(s, columns=['key', 'ts'])
print(df)

# Output
  key  ts
0   A   3
1   B   1
2   C   2
3   D   5
4   E   4

Sort by timestamps:

>>> df.sort_values('ts')
  key  ts
1   B   1
2   C   2
0   A   3
4   E   4
3   D   5

New values:

s1 = [('C', 6), ('A', 7), ('D', 8)]
df1 = pd.DataFrame(s1, columns=['key', 'ts'])
print(df1)

# Output
  key  ts
0   C   6
1   A   7
2   D   8

Merge new values:

>>> pd.concat([df, df1], ignore_index=True)
  key  ts
0   A   3
1   B   1
2   C   2
3   D   5
4   E   4
5   C   6
6   A   7
7   D   8

Remove duplicate keys:

>>> pd.concat([df, df1], ignore_index=True).drop_duplicates('key', keep='last')
  key  ts
1   B   1
4   E   4
5   C   6
6   A   7
7   D   8