I am trying to compare two databases. One has 70,000 rows of client names and the other is local case dockets that hold about 13,000 rows. In the docket database, there is two columns I wish to compare against the client names: plaintiffs and defendants. Unfortunately, the names in these parties are often slightly different than what we hold in our database. So for example: The docket_df plaintiff value may appear as 'Stack Overflow.com INC' and our client list may have them recorded as 'StackOverflow.com Incorporated'. Upon research of approximate matching, rapidfuzz
appeared to be a good selection for what I needed done, but I am certainly open to trying another one if this could potentially not have the processing capabilities I require for this project.
I have attempted to implement rapidfuzz
for approximate matching but something appears to be bottle necking or slowing down the processing time significantly. I also implemented parallel processing to help with the size of the data. The print statement print('Beginning Parallel Processing:')
prints as expected and it seems to work as anticipated until after this point when parallel processing begins. I will mark the location on my code. How would I optimize my code so that I could achieve the most efficient comparison of two databases using approximate matching techniques? I am trying to become better at debugging, so any suggestions to implement better practices are also welcome -- especially relating to fixing this code. Code below:
Libs:
from rapidfuzz import fuzz, process
from concurrent.futures import ProcessPoolExecutor
import pandas as pd
import re
import numpy as np
import pyodbc
Functions:
def query(query,connection):
pd.read_sql(query,connection)
# selects the best approximate match
def fuzzy_match(client, choices, threshold):
'''Function for finding the best approximate matches'''
best_match, score = process.extractOne(client,choices)
return best_match if score >= threshold else None
# returns the best approximate matching with correlating Party value
def process_batch(batch_clients):
'''Takes batches of clients and returns approximate match value based on their fuzzy threshold'''
# store the matches
matches = []
# convert our df column lists to a series
plaintiff_series = pd.Series(plaintiff_choices)
defendant_series = pd.Series(defendant_choices)
batch_series = pd.Series(batch_clients)
plaintiff_match, plaintiff_score = fuzzy_match(batch_series, plaintiff_series)
defendant_match, defendant_score = fuzzy_match(batch_series, defendant_series)
high_confidence_threshold = 75
low_confidence_threshold = 60
plaintiff_high_confidence_mask = plaintiff_score >= high_confidence_threshold
plaintiff_low_confidence_mask = (low_confidence_threshold <= plaintiff_score) & (plaintiff_score < high_confidence_threshold)
defendant_high_confidence_mask = defendant_score >= high_confidence_threshold
defendant_low_confidence_mask = (low_confidence_threshold <= defendant_score) & (defendant_score < high_confidence_threshold)
matches.extend(["High Confidence - Plaintiff" if x else "Low Confidence - Plaintiff" if y else "No Match" for x, y in zip(plaintiff_high_confidence_mask, plaintiff_low_confidence_mask)])
matches.extend(["High Confidence - Defendant" if x else "Low Confidence - Defendant" if y else "No Match" for x, y in zip(defendant_high_confidence_mask, defendant_low_confidence_mask)])
return matches
Code used to attempt to perform comparison:
try:
# we are going to connect to database
with pyodbc.connect(connection_string) as connection:
# confirmation we have connected
print("Connected to the database")
# download the table data we want to use
client_query = pd.read_sql("SELECT client_name FROM dbo.client_data_t;",connection)
print("Clients were loaded to the DataFrame")
if connection:
connection.close()
#confirmation message
print("Connection Closed!")
except Exception as e:
# if the connection is opened we want to close it
if connection:
connection.close()
# notify us of the error, and
print("We received an error")
# the error itself without the code expl.
print(e)
finally:
fuzzy_threshold = 60
# Split client_names into batches for parallel processing
batch_size = 5000
# create lists of the target columns
client_names = client_query['client_name'].tolist()
plaintiff_choices = docket_df['plaintiff'].tolist()
defendant_choices = docket_df['defendant'].tolist()
# create batches for parallel processing
client_batches = [client_names[i:i+batch_size] for i in range(0, len(client_names), batch_size)]
print("Batches readied")
# match results list variable
match_results = []
# Use parallel processing to match clients
print("Beginning Parallel Processing:") #### THIS PRINTS AND THEN IT STALLS ####
with ProcessPoolExecutor() as executor:
for batch_matches in executor.map(process_batch, client_batches):
match_results.extend(batch_matches)
print("Parallel Processing Complete")
# add the list to the client_query df
client_query["match_result"] = match_results
# let's see the results
print(client_query["match_result"].value_counts())
# if there is connection we would like to close this
if connection:
connection.close()
#confirmation message
print("Connection Closed!")
EDIT
I am going to leave this open to see if there is possibly a solution we can help another person who would rather use rapidfuzz
, but I did find a solution by implementing TF,IDF for approximate matching. It ended up speeding my processing time from not completing to 4 minute and 50 seconds. The article, written by Josh Taylor, I based this solution on: Fuzzy Matching at scale
The libs required:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.neighbors import NearestNeighbors
from scipy.sparse import csr_matrix
import sparse_dot_topn as ct
import re
import numpy as np
import pandas as pd
import pyodbc
import time
Functions related to TFIDF
def ngrams(string,n=3):
string = string.encode("ascii",errors="ignore").decode() # remove non-ascii characters
string = string.lower() # make lowercase for processing
chars_to_remove = [")","(",",",".","|","[","]","{","}","'"]
rx = '['+re.escape(''.join(chars_to_remove))+']'
string = re.sub(rx,'',string)# remove the list of chars defined above
string = string.replace('&','and')
string = string.replace(',',' ')
string = string.replace('-',' ')
string = string.title()# now we will normalize the case
string = re.sub(' +',' ',string).strip() # gets rid of multiple spaces and replaces them with a singular space
string = ' '+ string +' '
string = re.sub(r'[,-./]|\sBD',r'',string)
ngrams = zip(*[string[i:] for i in range(n)])
return [''.join(ngram) for ngram in ngrams]
def getNearestN(query):
queryTFDIF = vectorizer.transform(query)
distances, indices = nbrs.kneighbors(queryTFDIF)
return distances, indices
def awesome_cossim_top(A,B,ntop,lower_bound=0):
# force A and B as a CSR matrix
# if they have already been CSR there is overhead
A = A.tocsr()
B = B.tocsr()
M, _ = A.shape
_, N = B.shape
idx_dtype = np.int32
nnz_max = M*ntop
indptr = np.zeroes(M+1,dtype=idx_dtype)
indices = np.zeroes(nnz_max,dtype=idx_dtype)
data = np.zeroes(nnz_max,dtype=A.dtype)
ct.sparse_dot_topn(
M,N,np.asarray(A.indptr, dtype=idx_dtype),
np.asarray(A.indices, dtype=idx_dtype),
A.data,
np.asarray(B.indptr,dtype=idx_dtype),
np.asarray(B.indices,dtype=idx_dtype),
B.data,
ntop,
lower_bound,
indptr, indices, data
)
return csr_matrix((data,indices,indptr),shape=(M,N))
Code for implementing the approximate matching comparison:
# let's grab the unique values from the dataframe columns of interest
plaintiffs = docket_df['plaintiff'].unique()
# now time to vectorize the data
vectorizer = TfidfVectorizer(min_df=1,analyzer=ngrams,lowercase=False)
# we are going to create two separate tf,idf objects for plaintiff and defendant matching
plaintiff_tfidf = vectorizer.fit_transform(plaintiffs)
# now let's calculate our nearest neighbor to help improve approximate matching
nbrs = NearestNeighbors(n_neighbors=1,n_jobs=-1).fit(plaintiff_tfidf)
# now we will grab the client names, the article recommended using `set` to help further improve performance
client_col = 'client_name'
unique_clients = set(client_query[client_col].values)
# now we are going to track the performance
start = time.time()
print("Beginning Matching...")
distances, indices = getNearestN(unique_clients)
end = time.time()-start
print(f"Completed in: {end} seconds")
# now we need to convert the result back to a list
unique_clients = list(unique_clients)
matches = []
for i,j in enumerate(indices):
temp = [round(distances[i][0],2), plaintiffs[j],unique_clients[i]]
matches.append(temp)
plaintiff_match_results = pd.DataFrame(matches,columns=['Match confidence (lower is better)','Matched Plaintiff','Original Plaintiff'])
###########
# repeat the code above but now we need to perform this analysis on defendants
# let's grab the unique values from the dataframe columns of interest
defendants = docket_df['defendant'].unique()
# now time to vectorize the data
vectorizer = TfidfVectorizer(min_df=1,analyzer=ngrams,lowercase=False)
# we are going to create two separate tf,idf objects for plaintiff and defendant matching
defendant_tfidf = vectorizer.fit_transform(defendants)
# now let's calculate our nearest neighbor to help improve approximate matching
nbrs = NearestNeighbors(n_neighbors=1,n_jobs=-1).fit(defendant_tfidf)
# now we will grab the client names, the article recommended using `set` to help further improve performance
client_col = 'client_name'
unique_clients = set(client_query[client_col].values)
# now we are going to track the performance
start = time.time()
print("Beginning Defendant Matching...")
distances, indices = getNearestN(unique_clients)
end = time.time()-start
print(f"Completed in: {end} seconds")
# now we need to convert the result back to a list
unique_clients = list(unique_clients)
matches = []
for i,j in enumerate(indices):
temp = [round(distances[i][0],2), defendants[j],unique_clients[i]]
matches.append(temp)
defendant_match_results = pd.DataFrame(matches,columns=['Match confidence (lower is better)','Matched Defendant','Original Defendant'])
Again, shoutout and credit to Josh Taylor from towardsdatascience.com for putting together an excellent article where he helps you achieve approximate matching on a large scale. My code is modified from the excellent presentation of code and concept he wrote!