Use python Dedupe package to check for single record

746 Views Asked by At

I am using Dedupe python package to check for duplicates for my incoming records. I have trained approx. 500000 records from a CSV file. Using the Dedupe package, I have clustered the 500000 records into different clusters. I have attempted to use the settings_file got out of training to do dedupe for the new record(data in the code). I have shared a code snippet below.

import dedupe
from unidecode import unidecode
import os

deduper=None
if os.path.exists(settings_file):
    with open(settings_file, 'rb') as sf :
        deduper = dedupe.StaticDedupe(sf)

clustered_dupes = deduper.match(data, 0)

data, here is a single new record that I have to check if it has a duplicate or not. data looks like

{1:{'SequenceID': 6855406, 'ApplicationID': 7065902, 'CustomerID': 6153222, 'Name': 'X', 'col1': '-42332423', 'col2': '0', 'col3': '0', 'col4': '0', 'col5': '24G0859681', 'col6': '0', 'col7': 'xyz12345', 'col8': 'xyz', 'col9': '1234', 'col10': 'xyz10'}}

This throws an error.

No records have been blocked together. Is the data you are trying to match like the data you trained on?

How do I use this clustered data to check if new record is a duplicate or not? Is it possible to do like we would do with any ML model? I have looked into multiple sources but haven't found the solution to this problem.

Most of the sources talk about training and not about how I use the clustered data to check for a single record.

Is there another way out.

Some links that I have referred: link1 link2 link3

Any help is appreciated.

1

There are 1 best solutions below

0
On

You would need to pass the initially trained data along with the new record as input to cluster based on pre trained settings