Identifying Duplicate Customers Based on Similarity (Spark Dataframe)

231 Views Asked by Steve At 17 August 2025 at 05:07

I have a spark dataframe that contains customer information. Some clients are duplicates but it's hard for the computer to determine that without some form of fuzzy matching like levenstein distance, etc.

In the example below, John Smith and Johnny Smith are the same person but their "first_name" and "address" fields are slightly different. Other details like birthdate and phone number might not necessary be there. Therefore, I am only able to identify the same person with some % probability.

+----------+---------+-----------------------+-------------------+------------+-----------+
|first_name|last_name|birthdate              |address            |phone_number|client_uuid|
+----------+---------+-----------------------+-------------------+------------+-----------+
|John      |Smith    |1998-01-01 12:29:42.835|123 Bakersville    |555-555-5555|null       |
|Jay       |Leno     |1955-11-12 12:30:12.946|null               |null        |null       |
|Johnny    |Smith    |null                   |123 Bakersville St.|null        |null       |
+----------+---------+-----------------------+-------------------+------------+-----------+

Let's say I want to try to attempt solving this problem anyways. I would like my end result to fill out the final field "client_uuid". My ideal result will look something like this:

+----------+---------+-----------------------+-------------------+------------+-----------+
|first_name|last_name|birthdate              |address            |phone_number|client_uuid|
+----------+---------+-----------------------+-------------------+------------+-----------+
|John      |Smith    |1998-01-01 12:29:42.835|123 Bakersville    |555-555-5555|CLIENT_123       |
|Jay       |Leno     |1955-11-12 12:30:12.946|null               |null        |CLIENT_456       |
|Johnny    |Smith    |null                   |123 Bakersville St.|null        |CLIENT_123       |
+----------+---------+-----------------------+-------------------+------------+-----------+

I realize that this is not an easy problem and it's trying to tackle many small problems at once. In fact, this is not really a Spark data frames problem but bonus points if someone finds a solution with Spark DF.

A solution that I am contemplating is to transform each customer record into a vector and then I can use the cosine similiarity to determine how similar each record is to each other. If they are within some threshold, then I would assign them the same generated UUID.

I'm sure this isn't a new problem so I would be interested in hearing other approaches as well. If this is a solved problem and there is already a snippet or library that already solves this problem, that would be even better.

Original Q&A

Identifying Duplicate Customers Based on Similarity (Spark Dataframe)

There are 0 best solutions below

Related Questions in SCALA

Related Questions in APACHE-SPARK-SQL

Related Questions in COSINE-SIMILARITY

Related Questions in FUZZY-LOGIC

Related Questions in FUZZY-COMPARISON

Trending Questions

Popular # Hahtags

Popular Questions