I am trying do a fuzzy matching between two columns in which col="co_zip22" will iterate through all the rows in col="co_zip23" and will find a match with a match score So basically co_zip is a unique key which I have created combining company name and zip column and I am trying to find out if a company from 2022 is present in our 2023 recordor not.
I have made a file which consists of two columns containg the co_zip22 and co_zip23 to do the fuzzy match. We don't have any unique identifiers so I can created a string with company name and zip Below is my code and it's working fine for small records but it's keep on running for such a big data set and it has been running for 2 days now
similarity = []
for i in df.co_zip22:#full
ratio = process.extract( i, df.co_zip23, limit=1)
similarity.append(ratio[0][1])
df['similarity'] = pd.Series(similarity)
df.head(3)