I am using postgres dedupe example code. For 10,000 rows, it is consuming 163 seconds. I found that it is consuming most of the time in this part:
full_data = []
cluster_membership = collections.defaultdict(lambda : 'x')
for cluster_id, (cluster, score) in enumerate(clustered_dupes):
for record_id in cluster:
for row in data:
if record_id == int(row[0]):
row = list(row)
row.insert(0,cluster_id)
row = tuple(row)
full_data.append(row)
Is there any possible optimization for this part such that it produces the same result in less time complexity? Will this script work for 150 million records?