Reusing Dedupe training for Gazetteer matching

254 Views Asked by At

I'm using the Dedupe library to clean up some data. However, once the first deduplication is done using the Dedupe object, I understand we are supposed to use the Gazetteer object to match any new incoming data against the clustered data.

For the sake of explaining the issue, let's assume that :

  • The first batch of data is 500k rows of restaurants, with name, address, and phone number fields.
  • The second batch of data is, for instance, 1k new restaurants that did not exist at the time, but that I now want to match against the first 500k.

If I describe the pipeline, it goes something like this :

  • Step 1) Initial deduplication
    • Train a Dedupe object on a sample of the 500k restaurants
    • Cluster the 500k rows with a Dedupe / Static Dedupe object
  • Step 2) Incremental deduplication
    • Train a Gazetteer object on a sample of the 500k restaurants vs 1k new restaurants
    • Match incoming 1k rows against 500k previous rows
    • Assign canonical ID according to the 1k rows that actually matched an existing restaurant

So, the questions are :

  • Is the pipeline actually correct ?
  • Do I have to retrain the Gazetteer each time new data comes in ?
    • Can't I use the same blocking rules that I learned during the first step ? Or at least the same labelled pairs ? Assuming of course the fields are the same, and the data goes through exactly the same preprocessing.
  • I understand I could keep redoing step 1, but from what I read, is not the best practice.

@fgregg I went through all the Stackoverflow and Github issues (most recent one being this one), but could not find any helpful answers.

Thanks !

0

There are 0 best solutions below