Python dedupe library for bigdata

602 Views Asked by At

I am working running the Dedupe package on large datasets (4 million records/ 5 fields) with the following objectives:

  1. Deduplicate records (3.5 million)
  2. Record link incremental data ~ 100K with ~1.1 million

Note: Everything is in memory on spark and DBFS.

  1. I was able to run end to end dedupe on 60K records.
  2. The program hangs for 100K records on the Dedupe.Clustor() method. Get a warning for max component nodes being limited to 30K

Summary of steps:

  1. Block indexes

  2. Pair(data) - 3.5 million pairs for 100K records

  3. Score Pairs() - works fine, tested for 2 million input records and score pairs worked as expected

  4. Dedupe.Clustor(score(pair)) - hangs with the below error, anytime I try to pass more than 60K records.

Kindly suggest any pointers or big data examples that I can refer. MySQL is currently not the primary plan.

Warning: "3730000 A component contained 89927 elements. Components larger than 30000 are re-filtered. The threshold for this filtering is 0.0 A component contained 89927 elements. Components larger than 30000 are re-filtered. The threshold for this filtering is 0.0 A component contained 89927 elements. Components larger than 30000 are re-filtered. The threshold for this filtering is 0.0 A component contained 89927 elements. Components larger than 30000 are re-filtered. The threshold for this filtering is 0.0 A component contained 89927 elements. Components larger than 30000 are re-filtered. The threshold for this filtering is 0.0 A component contained 89927 elements. Components larger than 30000 are re-filtered. The threshold for this filtering is 0.0 A component contained 89927 elements. Components larger than 30000 are re-filtered. The threshold for this filtering is 0.0"

1

There are 1 best solutions below

0
On

we are now using postgresql approach -> Refer : https://github.com/dedupeio/dedupe-examples/tree/master/pgsql_big_dedupe_example
Version used - 2.0.13 Total records 18K with 16 core, 64 GIG RAM its is taking 20 mins to run along with manual labelling without any memory crash.

First issue version 2.0.14 is throwing error on some compatibility issue (discussed here on different threads)

Also 2.0.14 was giving slow performance ..

If you running with > 10K data postgresql will give better performance .