How to improve processing time when comparing large datasets with approximate matching and avoid bottlenecks?

Question

How to improve processing time when comparing large datasets with approximate matching and avoid bottlenecks?

46 Views Asked by Jeff Gordon At 07 June 2025 at 11:18

I am trying to compare two databases. One has 70,000 rows of client names and the other is local case dockets that hold about 13,000 rows. In the docket database, there is two columns I wish to compare against the client names: plaintiffs and defendants. Unfortunately, the names in these parties are often slightly different than what we hold in our database. So for example: The docket_df plaintiff value may appear as 'Stack Overflow.com INC' and our client list may have them recorded as 'StackOverflow.com Incorporated'. Upon research of approximate matching, rapidfuzz appeared to be a good selection for what I needed done, but I am certainly open to trying another one if this could potentially not have the processing capabilities I require for this project.

I have attempted to implement rapidfuzz for approximate matching but something appears to be bottle necking or slowing down the processing time significantly. I also implemented parallel processing to help with the size of the data. The print statement print('Beginning Parallel Processing:') prints as expected and it seems to work as anticipated until after this point when parallel processing begins. I will mark the location on my code. How would I optimize my code so that I could achieve the most efficient comparison of two databases using approximate matching techniques? I am trying to become better at debugging, so any suggestions to implement better practices are also welcome -- especially relating to fixing this code. Code below:

Libs:

from rapidfuzz import fuzz, process
from concurrent.futures import ProcessPoolExecutor
import pandas as pd
import re
import numpy as np
import pyodbc

Functions:

def query(query,connection):
    pd.read_sql(query,connection)
# selects the best approximate match
def fuzzy_match(client, choices, threshold):
    '''Function for finding the best approximate matches'''
    best_match, score = process.extractOne(client,choices)
    return best_match if score >= threshold else None
# returns the best approximate matching with correlating Party value
def process_batch(batch_clients):
    '''Takes batches of clients and returns approximate match value based on their fuzzy threshold'''
    # store the matches
    matches = []
    # convert our df column lists to a series
    plaintiff_series = pd.Series(plaintiff_choices)
    defendant_series = pd.Series(defendant_choices)
    batch_series = pd.Series(batch_clients)
    
    plaintiff_match, plaintiff_score = fuzzy_match(batch_series, plaintiff_series)
    defendant_match, defendant_score = fuzzy_match(batch_series, defendant_series)
    
    high_confidence_threshold = 75
    low_confidence_threshold = 60
    
    plaintiff_high_confidence_mask = plaintiff_score >= high_confidence_threshold
    plaintiff_low_confidence_mask = (low_confidence_threshold <= plaintiff_score) & (plaintiff_score < high_confidence_threshold)
    defendant_high_confidence_mask = defendant_score >= high_confidence_threshold
    defendant_low_confidence_mask = (low_confidence_threshold <= defendant_score) & (defendant_score < high_confidence_threshold)
    
    matches.extend(["High Confidence - Plaintiff" if x else "Low Confidence - Plaintiff" if y else "No Match" for x, y in zip(plaintiff_high_confidence_mask, plaintiff_low_confidence_mask)])
    matches.extend(["High Confidence - Defendant" if x else "Low Confidence - Defendant" if y else "No Match" for x, y in zip(defendant_high_confidence_mask, defendant_low_confidence_mask)])
    
    return matches

Code used to attempt to perform comparison:

try:
    # we are going to connect to database
    with pyodbc.connect(connection_string) as connection:
    # confirmation we have connected
        print("Connected to the database")
        # download the table data we want to use
        client_query = pd.read_sql("SELECT client_name FROM dbo.client_data_t;",connection)
        print("Clients were loaded to the DataFrame")
    if connection:
        connection.close()
        #confirmation message
        print("Connection Closed!")
    
except Exception as e:
    # if the connection is opened we want to close it
    if connection:
        connection.close()
    # notify us of the error, and
    print("We received an error")
    # the error itself without the code expl.
    print(e)
    
finally:
    fuzzy_threshold = 60
    # Split client_names into batches for parallel processing
    batch_size = 5000
    # create lists of the target columns
    client_names = client_query['client_name'].tolist()
    plaintiff_choices = docket_df['plaintiff'].tolist()
    defendant_choices = docket_df['defendant'].tolist()
    # create batches for parallel processing
    client_batches = [client_names[i:i+batch_size] for i in range(0, len(client_names), batch_size)]
    print("Batches readied")
    # match results list variable
    match_results = []
    # Use parallel processing to match clients
    print("Beginning Parallel Processing:") #### THIS PRINTS AND THEN IT STALLS ####
    with ProcessPoolExecutor() as executor:
        for batch_matches in executor.map(process_batch, client_batches):
            match_results.extend(batch_matches)
    print("Parallel Processing Complete")
    # add the list to the client_query df
    client_query["match_result"] = match_results
    # let's see the results
    print(client_query["match_result"].value_counts())
      
    # if there is connection we would like to close this
    if connection:
        connection.close()
        #confirmation message
        print("Connection Closed!")

EDIT I am going to leave this open to see if there is possibly a solution we can help another person who would rather use rapidfuzz, but I did find a solution by implementing TF,IDF for approximate matching. It ended up speeding my processing time from not completing to 4 minute and 50 seconds. The article, written by Josh Taylor, I based this solution on: Fuzzy Matching at scale

The libs required:

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.neighbors import NearestNeighbors
from scipy.sparse import csr_matrix
import sparse_dot_topn as ct
import re
import numpy as np
import pandas as pd
import pyodbc
import time

Functions related to TFIDF

def ngrams(string,n=3):
    string = string.encode("ascii",errors="ignore").decode() # remove non-ascii characters
    string = string.lower() # make lowercase for processing
    chars_to_remove = [")","(",",",".","|","[","]","{","}","'"]
    rx = '['+re.escape(''.join(chars_to_remove))+']'
    string = re.sub(rx,'',string)# remove the list of chars defined above
    string = string.replace('&','and')
    string = string.replace(',',' ')
    string = string.replace('-',' ')
    string = string.title()# now we will normalize the case
    string = re.sub(' +',' ',string).strip() # gets rid of multiple spaces and replaces them with a singular space
    string = ' '+ string +' '
    string = re.sub(r'[,-./]|\sBD',r'',string)
    ngrams = zip(*[string[i:] for i in range(n)])
    return [''.join(ngram) for ngram in ngrams]
def getNearestN(query):
    queryTFDIF = vectorizer.transform(query)
    distances, indices = nbrs.kneighbors(queryTFDIF)
    return distances, indices
def awesome_cossim_top(A,B,ntop,lower_bound=0):
    # force A and B as a CSR matrix
    # if they have already been CSR there is overhead
    A = A.tocsr()
    B = B.tocsr()
    M, _ = A.shape
    _, N = B.shape
    
    idx_dtype = np.int32
    
    nnz_max = M*ntop
    
    indptr = np.zeroes(M+1,dtype=idx_dtype)
    indices = np.zeroes(nnz_max,dtype=idx_dtype)
    data = np.zeroes(nnz_max,dtype=A.dtype)
    
    ct.sparse_dot_topn(
        M,N,np.asarray(A.indptr, dtype=idx_dtype),
        np.asarray(A.indices, dtype=idx_dtype),
        A.data,
        np.asarray(B.indptr,dtype=idx_dtype),
        np.asarray(B.indices,dtype=idx_dtype),
        B.data,
        ntop,
        lower_bound,
        indptr, indices, data
    )
    return csr_matrix((data,indices,indptr),shape=(M,N))

Code for implementing the approximate matching comparison:

# let's grab the unique values from the dataframe columns of interest
plaintiffs = docket_df['plaintiff'].unique()
# now time to vectorize the data
vectorizer = TfidfVectorizer(min_df=1,analyzer=ngrams,lowercase=False)
# we are going to create two separate tf,idf objects for plaintiff and defendant matching
plaintiff_tfidf = vectorizer.fit_transform(plaintiffs)
# now let's calculate our nearest neighbor to help improve approximate matching
nbrs = NearestNeighbors(n_neighbors=1,n_jobs=-1).fit(plaintiff_tfidf)
# now we will grab the client names, the article recommended using `set` to help further improve performance
client_col = 'client_name'
unique_clients = set(client_query[client_col].values)
# now we are going to track the performance
start = time.time()
print("Beginning Matching...")
distances, indices = getNearestN(unique_clients)
end = time.time()-start
print(f"Completed in: {end} seconds")
# now we need to convert the result back to a list
unique_clients = list(unique_clients)
matches = []
for i,j in enumerate(indices):
    temp = [round(distances[i][0],2), plaintiffs[j],unique_clients[i]]
    matches.append(temp)
plaintiff_match_results = pd.DataFrame(matches,columns=['Match confidence (lower is better)','Matched Plaintiff','Original Plaintiff'])

###########

# repeat the code above but now we need to perform this analysis on defendants
# let's grab the unique values from the dataframe columns of interest
defendants = docket_df['defendant'].unique()
# now time to vectorize the data
vectorizer = TfidfVectorizer(min_df=1,analyzer=ngrams,lowercase=False)
# we are going to create two separate tf,idf objects for plaintiff and defendant matching
defendant_tfidf = vectorizer.fit_transform(defendants)
# now let's calculate our nearest neighbor to help improve approximate matching
nbrs = NearestNeighbors(n_neighbors=1,n_jobs=-1).fit(defendant_tfidf)
# now we will grab the client names, the article recommended using `set` to help further improve performance
client_col = 'client_name'
unique_clients = set(client_query[client_col].values)
# now we are going to track the performance
start = time.time()
print("Beginning Defendant Matching...")
distances, indices = getNearestN(unique_clients)
end = time.time()-start
print(f"Completed in: {end} seconds")
# now we need to convert the result back to a list
unique_clients = list(unique_clients)
matches = []
for i,j in enumerate(indices):
    temp = [round(distances[i][0],2), defendants[j],unique_clients[i]]
    matches.append(temp)
defendant_match_results = pd.DataFrame(matches,columns=['Match confidence (lower is better)','Matched Defendant','Original Defendant'])

Again, shoutout and credit to Josh Taylor from towardsdatascience.com for putting together an excellent article where he helps you achieve approximate matching on a large scale. My code is modified from the excellent presentation of code and concept he wrote!

Original Q&A

How to improve processing time when comparing large datasets with approximate matching and avoid bottlenecks?

There are 0 best solutions below

Related Questions in PYTHON

Related Questions in DATABASE

Related Questions in DATAFRAME

Related Questions in APPROXIMATE

Related Questions in RAPIDFUZZ

Trending Questions

Popular # Hahtags

Popular Questions