How to use FuzzyWuzzy to match and merge two datasets in Python

72 Views Asked by At

I have three excel datasets that I am trying to merge. These three datasets all have names in column A, however across all three datasets there are punctuation, different spelling, different spaces, etc. I would like to merge these three datasets based on similar names in column A using FuzzyWuzzy.

NAME            Age 
Jason Kai         15
George Jameson    22
Michael C. Henry  21
Max Jones         61
Tom Reyes         46 

NAME        Gender 
Jason K.         M
George Jamson    M
Michael  Henry   M
Max Jones        M
Tom Reyes,       M

NAME     Height(inch)
Jason Ka         76
George Jameson   65
Michael Henry    68
M. Jones         60
Tom Reyes,       80


I would like the merged data to look like this:

NAME             Age    Gender      Height(inch)
Jason Kai         15         M           76
George Jameson    22         M           65
Michael C. Henry  21         M           68
Max Jones         61         M           60
Tom Reyes         46         M           80
 

Whats the best way to go about this?

I tried a normal pd.merge but that only matches and merges the columns exactly

1

There are 1 best solutions below

0
On

Here is a small example how this could be done. The names in my dataframes are not always in the same order which added an extra difficulty to this. The program loops through every name and adds values to the original dataframe if fuzzywuzzy finds a similarity score of above 75%.

import numpy as np
from fuzzywuzzy import fuzz
import pandas as pd

df1=pd.read_csv("df1")
df2=pd.read_csv("df2")
df3=pd.read_csv("df3")
df1["Gender"]=np.nan
df1["Height(inch)"]=np.nan
sim_score=75

for row_counter in range(df1.shape[0]):
    searched_name = df1.iloc[row_counter]["NAME"]
    genders = df2.apply(lambda x: x["Gender"] if fuzz.partial_ratio(x['NAME'], searched_name) > sim_score else np.NAN , axis=1)
    heigths = df3.apply(lambda x: x["Height(inch)"] if fuzz.partial_ratio(x['NAME'], searched_name) > sim_score else np.NAN, axis=1)
    df1["Gender"].fillna(genders, inplace=True)
    df1["Height(inch)"].fillna(heigths, inplace=True)
print(df1)

Weakness would be if multiple names are very similar but belong to different persons. Here you could add some more needed accuracy or an additional fallback mechanism.

-------df1------
NAME,Age
Jason Kai,         15
George Jameson,    22
Michael C. Henry,  21
Max Jones,         61
Tom Reyes,         46
------df2------
NAME,Gender
Michael  Henry,   M
Jason K,          F
Max Jones,        M
Toms Reyes,       M
George Jamson,    D
------df3-----
NAME,Height(inch)
Jason Ka,         76
M. Jones,        60
George Jameson,   65
Michael Henry,    68
Tom Reyes!,      80


               NAME  Age       Gender  Height(inch)
0         Jason Kai   15            M          76.0
1    George Jameson   22            F           NaN
2  Michael C. Henry   21            M          65.0
3         Max Jones   61            M          68.0
4         Tom Reyes   46            D          80.0