Python dedupe library for bigdata

Question

Python dedupe library for bigdata

615 Views Asked by ABJ_1659 At 01 July 2025 at 10:49

I am working running the Dedupe package on large datasets (4 million records/ 5 fields) with the following objectives:

Deduplicate records (3.5 million)
Record link incremental data ~ 100K with ~1.1 million

Note: Everything is in memory on spark and DBFS.

I was able to run end to end dedupe on 60K records.
The program hangs for 100K records on the Dedupe.Clustor() method. Get a warning for max component nodes being limited to 30K

Summary of steps:

Block indexes
Pair(data) - 3.5 million pairs for 100K records
Score Pairs() - works fine, tested for 2 million input records and score pairs worked as expected
Dedupe.Clustor(score(pair)) - hangs with the below error, anytime I try to pass more than 60K records.

Kindly suggest any pointers or big data examples that I can refer. MySQL is currently not the primary plan.

Warning: "3730000 A component contained 89927 elements. Components larger than 30000 are re-filtered. The threshold for this filtering is 0.0 A component contained 89927 elements. Components larger than 30000 are re-filtered. The threshold for this filtering is 0.0 A component contained 89927 elements. Components larger than 30000 are re-filtered. The threshold for this filtering is 0.0 A component contained 89927 elements. Components larger than 30000 are re-filtered. The threshold for this filtering is 0.0 A component contained 89927 elements. Components larger than 30000 are re-filtered. The threshold for this filtering is 0.0 A component contained 89927 elements. Components larger than 30000 are re-filtered. The threshold for this filtering is 0.0 A component contained 89927 elements. Components larger than 30000 are re-filtered. The threshold for this filtering is 0.0"

Original Q&A

There are 1 best solutions below

**Sarbani Maiti** · Answer 1

we are now using postgresql approach -> Refer : https://github.com/dedupeio/dedupe-examples/tree/master/pgsql_big_dedupe_example
Version used - 2.0.13 Total records 18K with 16 core, 64 GIG RAM its is taking 20 mins to run along with manual labelling without any memory crash.

First issue version 2.0.14 is throwing error on some compatibility issue (discussed here on different threads)

Also 2.0.14 was giving slow performance ..

If you running with > 10K data postgresql will give better performance .

Python dedupe library for bigdata

There are 1 best solutions below

Related Questions in PYTHON

Related Questions in PYTHON-3.X

Related Questions in PYTHON-DEDUPE

Trending Questions

Popular # Hahtags

Popular Questions