fastest way to do fuzzy matching two strings in pandas data frame

5.1k Views Asked by At

I have two data frames with name list

df1[name]   -> number of rows 3000

df2[name]   -> number of rows 64000

I am using fuzzy wuzzy to get the best match for df1 entries from df2 using the following code:

from fuzzywuzzy import fuzz
from fuzzywuzzy import process

matches = [process.extract(x, df1, limit=1) for x in df2]

But this is taking forever to finish. Is there any faster way to do the fuzzy matching of strings in pandas?

2

There are 2 best solutions below

5
On BEST ANSWER

One improvement i can see in your code is to use generator, so instead of square brackets, you can use round brackets. it will increase the speed by multiple time.

matches = (process.extract(x, df1, limit=1) for x in df2)

Edit: One more suggestion, we can parallelize the operation with multiprocessing library.

1
On

You can use python's multithreading package to speed it up. Pandas doesn't leverage multi cores by default.