I'm using the Dedupe library to clean up some data. However, once the first deduplication is done using the Dedupe object, I understand we are supposed to use the Gazetteer object to match any new incoming data against the clustered data.
For the sake of explaining the issue, let's assume that :
- The first batch of data is 500k rows of restaurants, with name, address, and phone number fields.
- The second batch of data is, for instance, 1k new restaurants that did not exist at the time, but that I now want to match against the first 500k.
If I describe the pipeline, it goes something like this :
- Step 1) Initial deduplication
- Train a Dedupe object on a sample of the 500k restaurants
- Cluster the 500k rows with a Dedupe / Static Dedupe object
- Step 2) Incremental deduplication
- Train a Gazetteer object on a sample of the 500k restaurants vs 1k new restaurants
- Match incoming 1k rows against 500k previous rows
- Assign canonical ID according to the 1k rows that actually matched an existing restaurant
So, the questions are :
- Is the pipeline actually correct ?
- Do I have to retrain the Gazetteer each time new data comes in ?
- Can't I use the same blocking rules that I learned during the first step ? Or at least the same labelled pairs ? Assuming of course the fields are the same, and the data goes through exactly the same preprocessing.
- I understand I could keep redoing step 1, but from what I read, is not the best practice.
@fgregg I went through all the Stackoverflow and Github issues (most recent one being this one), but could not find any helpful answers.
Thanks !