Scaling Dedupe package functionality to large data using mysql DB

792 Views Asked by At

I have been now trying for a while to make a working example of the gazetteer/dedupe that scales to semi-large datasets connecting to SQL (using examples provided by the package) and have been unsuccessful. Would really appreciate if anyone could provide me with some help or share their working samples.

Things I have tried so far:

  • I have tried the SQL example. I had to break some of the sql codes to separate create and insert statements to meet GTID standards but everything else follows the example. The issue I have with this is when it gets to the clustering part (after seemingly successfully running up to that point) and gives me the following error:
    "dedupe.core.BlockingError: No records have been blocked together. Is the data you are trying to match like the data you trained on?" No matter what I did, this was not fixed ( I am training and testing on same data exactly so this error does not make sense to me.)

  • For large scale gazetteer I have tried using this example to begin with, but this is the error I get: "TypeError: train() takes at most 3 arguments (4 given)". The only change I have made here is that I am connecting to a mysql db. Also, I cannot find any guidance on how to actually scale all parts of gazetteer matching (or just do not understand how this example is helping with that).

Has anyone been able to actually scale these to large data using mysql?

Please let me know if I need to provide more info or code snippets.

Thanks in advance.

0

There are 0 best solutions below