Fuzzy Searching a Column in Pandas

1.1k Views Asked by At

Is there a way to search for a value in a dataframe column using FuzzyWuzzy or similar library? I'm trying to find a value in one column that corresponds to the value in another while taking fuzzy matching into account. So

So for example, if I have State Names in one column and State Codes in another, how would I find the state code for Florida, which is FL while catering for abbreviations like "Flor"?

So in other words, I want to find a match for a State Name corresponding to "Flor" and get the corresponding State Code "FL".

Any help is greatly appreciated.

1

There are 1 best solutions below

0
On

If the abbreviations are all prefixes, you can use the .startswith() string method against either the short or long version of the state.

>>> test_value = "Flor"
>>> test_value.upper().startswith("FL")
True
>>> "Florida".lower().startswith(test_value.lower())
True

However, if you have more complex abbreviations, difflib.get_close_matches will probably do what you want!

>>> import pandas as pd
>>> import difflib
>>> df = pd.DataFrame({"states": ("Florida", "Texas"), "st": ("FL", "TX")})
>>> df
    states  st
0  Florida  FL
1    Texas  TX
>>> difflib.get_close_matches("Flor", df["states"].to_list())
['Florida']
>>> difflib.get_close_matches("x", df["states"].to_list(), cutoff=0.2)
['Texas']
>>> df["st"][df.index[df["states"]=="Texas"]].iloc[0]
'TX'

You will probably want to try/except IndexError around reading the first member of the returned list from difflib and possibly tweak the cutoff to get less false matches with close states (perhaps offer all the states as possibilities to some user or require more letters for close states).

You may also see the best results combining the two; testing prefixes first before trying the fuzzy match.

Putting it all together

def state_from_partial(test_text, df, col_fullnames, col_shortnames):
    if len(test_text) < 2:
        raise ValueError("must have at least 2 characters")

    # if there's exactly two characters, try to directly match short name
    if len(test_text) == 2 and test_text.upper() in df[col_shortnames]:
        return test_text.upper()

    states = df[col_fullnames].to_list()
    match = None
    # this will definitely fail at least for states starting with M or New
    #for state in states:
    #    if state.lower().startswith(test_text.lower())
    #        match = state
    #        break  # leave loop and prepare to find the prefix

    if not match:
        try:  # see if there's a fuzzy match
            match = difflib.get_close_matches(test_text, states)[0]  # cutoff=0.6
        except IndexError:
            pass  # consider matching against a list of problematic states with different cutoff

    if match:
        return df[col_shortnames][df.index[df[col_fullnames]==match]].iloc[0]

    raise ValueError("couldn't find a state matching partial: {}".format(test_text))

Beware of states which start with 'New' or 'M' (and probably others), which are all pretty close and will probably want special handling. Testing will do wonders here.