Dedupe library, Blocking issue, Missing matches

165 Views Asked by At

I have a CSV file with 3M rows and two columns, it just Arabic Student_name and Id,

I wanted to cluster similar names that refer to the same student, the names maybe have spelling typos or extra spaces as an example.

In the clustered output, there are a lot of missed matches, for example, two names are the same and there is one extra space in one of them, in the result file, it clusters them sometimes in one cluster and sometimes in different clusters.

Let say there are five(5) similar names but with small spelling differences, in the output file, it's giving me three(3) of them in one cluster and the rest in different clusters although they have similar differences. This happens even if I sort the names alphabetically.

I guess that the issue in the blocking function.

Is that right? Could you please guide me to fix it? How to increase the block size.

I tried to increase the max_components variable in the Cluster function but I ended up with a memory error.

Thanks in advance.

0

There are 0 best solutions below