sortedcontainers.SortedKeyList: Return Element with Matching Value of Key

51 Views Asked by At

I am using sortedcontainers SortedKeyList to store a list of dictionaries. The dictionaries are sorted within the list by the value of a (nested) key (i.e., "token_str").

The SortedKeyList appears to be properly stored in self.tokenized_files (e.g., as shown below), which has been verified via the debugger.

However, when I attempt to retrieve an element (i.e., dictionary) from the list with a given value of the key (i.e., token_str), I get TypeError: string indices must be integers, not 'str'.

How can I retrieve an element (i.e., dictionary) from a SortedKeyList using the value of the nested key? Preferably, using an efficient search algorithm, such as bisect, rather than iterating through the entire list.

self.sorted_index_by_token_str = None
self.tokenized_files = SortedKeyList([
                {'full_path': '/path/to/file/mdb_00033k__filename1.pdf', 'tokenized_filename': {'basic_filename': 'filename1.pdf', 'token': {'token_str': 'mdb_00033k'}}},
                {'full_path': '/path/to/file/mdb_0027zz__filename2.pdf', 'tokenized_filename': {'basic_filename': 'filename2.pdf', 'token': {'token_str': 'mdb_0027zz'}}},
            ])
               

def generate_index(self) -> SortedKeyList:
    """Creates an index comprising a list of dicts of tokenized files sorted by token_str.
    """

    self.sorted_index_by_token_str = SortedKeyList(
        self.tokenized_files,
        key=lambda x: x["tokenized_filename"]["token"]["token_str"],
    )
    # Evaluate: x["tokenized_filename"]["token"]["token_str"]
    # Returns: 'mdb_0027zz'
    return self.sorted_index_by_token_str


def find_tokenized_file(self) -> dict:
    """Finds a tokenized file in the index by performing a binary search on token_str.
    """

    test = self.sorted_index_by_token_str.index("mdb_0027zz")


# ERROR:
# test = self.sorted_index_by_token_str.index("mdb_0027zz")
# ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
# key = self._key(value)
# ^^^^^^^^^^^^^^^^
# key=lambda x: x["tokenized_filename"]["token"]["token_str"],
# ~^^^^^^^^^^^^^^^^^^^^^^
# TypeError: string indices must be integers, not 'str'```
1

There are 1 best solutions below

3
On

The key needs to be the entire nested dictionary (not just a simple key value as I believed). The bisect_left method is then used to determine the index of the element, with a subsequent check to confirm whether the element located at the index matches the key-value.


       self.sorted_index_by_token_str = SortedKeyList(
            key=lambda x: x["tokenized_filename"]["token"]["token_str"]
        )

        key = {"tokenized_filename": {"token": {"token_str": token_str}}}
        index = self.sorted_index_by_token_str.bisect_left(key)

        # Check if the token_str is found at the returned index
        if (
            index < len(self.sorted_index_by_token_str)
            and self.sorted_index_by_token_str[index]["tokenized_filename"]["token"][
                "token_str"
            ]
            == token_str
        ):
            return self.sorted_index_by_token_str[index]

        # token_str not found
        return None

OR, using bisect_key_left (per @user2357112):

from bisect import bisect_key_left

key = {"tokenized_filename": {"token": {"token_str": token_str}}}
index = bisect_key_left(self.sorted_index_by_token_str, key, key=lambda x: x["tokenized_filename"]["token"]["token_str"])

# Check if the token_str is found at the returned index
if (
    index < len(self.sorted_index_by_token_str)
    and self.sorted_index_by_token_str[index]["tokenized_filename"]["token"][
        "token_str"
    ]
    == token_str
):
    return self.sorted_index_by_token_str[index]

# token_str not found
return None