How can I compare one column of a dataframe to multiple other columns using SequenceMatcher?

81 Views Asked by At

I have a dataframe with 6 columns, the first two are an id and a name column, the remaining 4 are potential matches for the name column.

id name match1 match2 match3 match4

 id                                          name                                       match1                          match2                                        match3                       match4
  1                            NXP Semiconductors                                          NaN                             NaN                                           NaN                          NaN
  2 Cincinnati Children's Hospital Medical Center                    Montefiore Medical center Children's Hospital Los Angeles Cincinnati Children's Hospital Medical Center      SSM Health SLU Hospital
  3                     Seminole Tribe of Florida The State Board of Administration of Florida                             NaN                                           NaN                          NaN
  4                             Miami-Dade County                               County of Will                County of Orange                                           NaN                          NaN
  5                      University of California             California Teacher's Association                 Yale University                         University of Toronto University System of Georgia
  6                        Bon Appetit Management                             Waste Management                Sculptor Capital                                           NaN                          NaN

I'd like to use SequenceMatcher to compare the name column with each match column if there is a value and return the match value with the highest ratio, or closest match, in a new column at the end of the dataframe.

So the output would be something like this:

 id                                          name                                       match1                          match2                                        match3                       match4                                    best match
  1                            NXP Semiconductors                                          NaN                             NaN                                           NaN                          NaN                                           NaN
  2 Cincinnati Children's Hospital Medical Center                    Montefiore Medical center Children's Hospital Los Angeles Cincinnati Children's Hospital Medical Center      SSM Health SLU Hospital Cincinnati Children's Hospital Medical Center
  3                     Seminole Tribe of Florida The State Board of Administration of Florida                             NaN                                           NaN                          NaN  The State Board of Administration of Florida
  4                             Miami-Dade County                               County of Will                County of Orange                                           NaN                          NaN                              County of Orange
  5                      University of California             California Teacher's Association                 Yale University                         University of Toronto University System of Georgia              California Teacher's Association
  6                        Bon Appetit Management                             Waste Management                Sculptor Capital                                           NaN                          NaN                              Waste Management

I've gotten the data into the dataframe and have been able to compare one column to a single other column using the apply method:

df['diff'] = df.apply(lambda x: diff.SequenceMatcher(None, x[0].strip(), x[1].strip()).ratio(), axis=1) 

However, I'm not sure how to loop over multiple columns in the same row. I also thought about trying to reformat my data so it that the method above would work, something like this:

name     match
name1    match1
name1    match2
name1    match3

However, I was running into issues dealing with the NaN values. Open to suggestions on the best route to accomplish this.

1

There are 1 best solutions below

0
On

I ended up solving this using the second idea of reformatting the table. Using the melt function I was able to get a two column table of the name field with each possible match. From there I used the original lambda function to compare the two columns and output a ratio. From there it was relatively easy to go through and see the most likely matches, although it did require some manual effort.

df = pd.read_csv('output.csv')
df1 = df.melt(id_vars = ['id', 'name'], var_name = 'match').dropna().drop('match',1).sort_values('name')
df1['diff'] = df1.apply(lambda x: diff.SequenceMatcher(None, x[1].strip(), x[2].strip()).ratio(), axis=1) 
df1.to_csv('comparison-output.csv', encoding='utf-8')