Python Postgresql dedupe consuming a lot of time. Can there be any optimization?

324 Views Asked by At

I am using postgres dedupe example code. For 10,000 rows, it is consuming 163 seconds. I found that it is consuming most of the time in this part:

full_data = []
cluster_membership = collections.defaultdict(lambda : 'x')
for cluster_id, (cluster, score) in enumerate(clustered_dupes):
    for record_id in cluster:
        for row in data:
            if record_id == int(row[0]):
                row = list(row)
                row.insert(0,cluster_id)
                row = tuple(row)
                full_data.append(row)

Is there any possible optimization for this part such that it produces the same result in less time complexity? Will this script work for 150 million records?

0

There are 0 best solutions below