Scaling Dedupe package functionality to large data using mysql DB

787 Views Asked by mersa At 17 August 2025 at 21:16

I have been now trying for a while to make a working example of the gazetteer/dedupe that scales to semi-large datasets connecting to SQL (using examples provided by the package) and have been unsuccessful. Would really appreciate if anyone could provide me with some help or share their working samples.

Things I have tried so far:

I have tried the SQL example. I had to break some of the sql codes to separate create and insert statements to meet GTID standards but everything else follows the example. The issue I have with this is when it gets to the clustering part (after seemingly successfully running up to that point) and gives me the following error:
"dedupe.core.BlockingError: No records have been blocked together. Is the data you are trying to match like the data you trained on?" No matter what I did, this was not fixed ( I am training and testing on same data exactly so this error does not make sense to me.)
For large scale gazetteer I have tried using this example to begin with, but this is the error I get: "TypeError: train() takes at most 3 arguments (4 given)". The only change I have made here is that I am connecting to a mysql db. Also, I cannot find any guidance on how to actually scale all parts of gazetteer matching (or just do not understand how this example is helping with that).

Has anyone been able to actually scale these to large data using mysql?

Please let me know if I need to provide more info or code snippets.

Thanks in advance.

Original Q&A

Scaling Dedupe package functionality to large data using mysql DB

There are 0 best solutions below

Related Questions in MYSQL

Related Questions in PERFORMANCE

Related Questions in RECORD-LINKAGE

Related Questions in PYTHON-DEDUPE

Related Questions in ENTITYRESOLVER

Trending Questions

Popular # Hahtags

Popular Questions