I have three excel datasets that I am trying to merge. These three datasets all have names in column A, however across all three datasets there are punctuation, different spelling, different spaces, etc. I would like to merge these three datasets based on similar names in column A using FuzzyWuzzy.
NAME Age
Jason Kai 15
George Jameson 22
Michael C. Henry 21
Max Jones 61
Tom Reyes 46
NAME Gender
Jason K. M
George Jamson M
Michael Henry M
Max Jones M
Tom Reyes, M
NAME Height(inch)
Jason Ka 76
George Jameson 65
Michael Henry 68
M. Jones 60
Tom Reyes, 80
I would like the merged data to look like this:
NAME Age Gender Height(inch)
Jason Kai 15 M 76
George Jameson 22 M 65
Michael C. Henry 21 M 68
Max Jones 61 M 60
Tom Reyes 46 M 80
Whats the best way to go about this?
I tried a normal pd.merge but that only matches and merges the columns exactly
Here is a small example how this could be done. The names in my dataframes are not always in the same order which added an extra difficulty to this. The program loops through every name and adds values to the original dataframe if fuzzywuzzy finds a similarity score of above 75%.
Weakness would be if multiple names are very similar but belong to different persons. Here you could add some more needed accuracy or an additional fallback mechanism.