Help with a FuzzyLookup in between two different CSV files (which contain company info).
CSV #1 data1.csv has two columns as well (5 thousand rows)
Name, ID
CSV #2 data2.csv has two columns (1.05 million rows)
LegalName, AcctNumber
The target would be to do a FuzzyLookup of Name to LegalNames
(show two matching LegalNames based on a percentage of accuracy would be ideal)
I saw that there is the following library: https://github.com/seatgeek/fuzzywuzzy
Plus, the following GitHub shows a similar idea of what I want to do
https://github.com/Cheukting/fuzzy-match-company-name
Which uses Pandas and FuzzyWuzzy
Hers was helpful since she also tries to filter the most common names in a company name (such as "a," "the," 'LLC," "Ltd," "and," etc.)
The only thing is that from my understanding, hers compares if there are duplicates within the same list, not within two different ones. So, it would be ideal to have the same concept across data1.csv and data2.csv
Also, feel free to contribute with a method that uses a different library as well! Any help is welcomed :)
Below is her code:
The data we used is found on http://download.companieshouse.gov.uk/en_output.html it is an openly licensed publicly avalible dataset that contains a list of registered (limited liability) companies in Great Britain. (the version shown here is snapshot of May 2018)
import pandas as pd
pd.set_option('display.max_columns', 1000)
df = pd.read_csv("BasicCompanyDataAsOneFile-2020-10-01.csv")
df.head()
This is a huge table with lots of rows, it may take a while to load
df.columns
df['RegAddress.PostTown'].value_counts().head(30)
Frequency of words
Since we have lots of companies, we will only use companies in Cambridge as an example.
First we find the 30 most common words in all company names. As we will be expecting them to be repeating a lot even in companies that is not the same, we cannot match company names using them. The way we do it is we will deduct the matching score of a pair if any keywords is present in the names.
from collections import Counter
all_names = df['CompanyName'][df['RegAddress.PostTown']=='CAMBRIDGE'].unique()
names_freq = Counter()
for name in all_names:
names_freq.update(str(name).split(" "))
key_words = [word for (word,_) in names_freq.most_common(30)]
print(key_words)
Show length
len(all_names)
Matching by Grouping
Then we group the names by their 1st character. As the list is too long, it will take forever to match them all at once (15889 x 15889 pairs to consider). The work around is to match them by groups, assuming if the names are not matched at the 1st character, it is unlikely that they are the same name.
all_main_name = pd.DataFrame(columns=['sort_gp','names','alias','score'])
all_names.sort()
all_main_name['names'] = all_names
all_main_name['sort_gp'] = all_main_name['names'].apply(lambda x: x[0])
Fuzzy Matching
Here for each group, we use fuzzywuzzy.token_sort_ratio to matching the names. Different form the basic fuzzywuzzy.ratio which use Levenshtein Distance to calculate the differences, it allow the token (words) in a name to swap order and still give a 'perfect' match. (ref: https://github.com/seatgeek/fuzzywuzzy)
from fuzzywuzzy import fuzz
all_sort_gp = all_main_name['sort_gp'].unique()
def no_key_word(name):
"""check if the name contain the keywords in travel company"""
output = True
for key in key_words:
if key in name:
output = False
return output
for sortgp in all_sort_gp:
this_gp = all_main_name.groupby(['sort_gp']).get_group(sortgp)
gp_start = this_gp.index.min()
gp_end = this_gp.index.max()
for i in range(gp_start,gp_end+1):
# if self has not got alias, asign to be alias of itself
if pd.isna(all_main_name['alias'].iloc[i]):
all_main_name['alias'].iloc[i] = all_main_name['names'].iloc[i]
all_main_name['score'].iloc[i] = 100
# if the following has not got alias and fuzzy match, asign to be alias of this one
for j in range(i+1,gp_end+1):
if pd.isna(all_main_name['alias'].iloc[j]):
fuzz_socre = fuzz.token_sort_ratio(all_main_name['names'].iloc[i],all_main_name['names'].iloc[j])
if not no_key_word(all_main_name['names'].iloc[j]):
fuzz_socre -= 10
if (fuzz_socre > 85):
all_main_name['alias'].iloc[j] = all_main_name['alias'].iloc[i]
all_main_name['score'].iloc[j] = fuzz_socre
if i % (len(all_names)//10) == 0:
print("progress: %.2f" % (100*i/len(all_names)) + "%")
all_main_name.to_csv('company_in_cambridge.csv')
Preview
all_main_name[(all_main_name['names']!=all_main_name['alias']) & (all_main_name['alias'].notna())]
Thanks!
Expected answer:
all_main_name[(all_main_name['names']!=all_main_name['alias']) & (all_main_name['alias'].notna())]
sort_gp, names, alias, score
955, A, AMADEUS EII LP, AMADEUS EI LP, 96
956, A, AMADEUS EIII LP, AMADEUS EI LP, 93
957, A, AMADEUS GI LP, AMADEUS EI LP, 92
958, A, AMADEUS HI LP, AMADEUS EI LP, 92
960, A, AMADEUS II 'A', AMADEUS I, 86