Run a query against all values within nested lists of a multi-valued dictionary

816 Views Asked by At

I have a 'collections.defaultdict' (see x below) that is a multi-valued dictionary. All values associated with each unique key are stored in a list.

    >>>x
    defaultdict(<type 'list'>, {'a': ['aa', 'ab', 'ac'], 'b': ['ba', 'bc'], 'c': ['ca', 'cb', 'cc', 'cd']})

I want to use the Python fuzzywuzzy package in order to search a target string against all the values nested in the multi-valued dictionary and return the top 5 matches based on fuzzywuzzy's built-in edit distance formula.

    from fuzzywuzzy import fuzz
    from fuzzywuzzy import process
    query = 'bc'
    choices = x
    result = process.extract(query, choices, limit=5)

And then I will run a process that takes the closest match (value with highest fuzz ratio score) and identifies which key that closest matched value is associated with. In this example, the closest matched value is of course 'bc' and the associated key is 'b'.

My question is: How do I run the fuzzywuzzy query against all of the values within the nested lists of the dictionary? When I run the fuzzywuzzy process above, I get a TypeError: expected string or buffer.

2

There are 2 best solutions below

0
On BEST ANSWER

To get all the values in the lists from your dictionary in a flat list, use
from itertools import chain and change the line

choices = x

to

choices = chain.from_iterable(x.values())

Consider making a set out of that if in your real data you have overlapping values.

result:

[('bc', 100), ('ba', 50), ('ca', 50), ('cb', 50), ('cc', 50)]
0
On

You could do this as follows:

from fuzzywuzzy import process
from collections import defaultdict

x = defaultdict(list, {'a': ['aa', 'ab', 'ac'], 'b': ['ba', 'bc'], 'c': ['ca', 'cb', 'cc', 'cd']})
query = 'bc'
reverse = defaultdict(list)

for k1, v1 in x.items():
    for v2 in v1:
        reverse[v2].append(k1)

match = process.extractOne(query, chain.from_iterable(x.values()))

print match[0]
print reverse[match[0]]

This would display:

bc
['b']

It first creates an inverse of your dictionary to make it easier to find where the entry which fuzzywuzzy matches. It then creates a list of all of the values and passes this to extractOne. The returned match can then be looked up in the reversed dictionary to display a list of all of the keys containing the match. If bc was found in more than one of your lists, it would display all of them.