Dedupe library, Blocking issue, Missing matches

175 Views Asked by Ahmad Al-Shalabi At 01 July 2025 at 11:59

I have a CSV file with 3M rows and two columns, it just Arabic Student_name and Id,

I wanted to cluster similar names that refer to the same student, the names maybe have spelling typos or extra spaces as an example.

In the clustered output, there are a lot of missed matches, for example, two names are the same and there is one extra space in one of them, in the result file, it clusters them sometimes in one cluster and sometimes in different clusters.

Let say there are five(5) similar names but with small spelling differences, in the output file, it's giving me three(3) of them in one cluster and the rest in different clusters although they have similar differences. This happens even if I sort the names alphabetically.

I guess that the issue in the blocking function.

Is that right? Could you please guide me to fix it? How to increase the block size.

I tried to increase the max_components variable in the Cluster function but I ended up with a memory error.

Thanks in advance.

Original Q&A

Dedupe library, Blocking issue, Missing matches

There are 0 best solutions below

Related Questions in PYTHON

Related Questions in INDEXING

Related Questions in DUPLICATES

Related Questions in BLOCKING

Related Questions in PYTHON-DEDUPE

Trending Questions

Popular # Hahtags

Popular Questions