Integrity check of two unordered datasets produced by select statement on Mysql and SQL Server via hash functions

41 Views Asked by At

Background information: I have implemented a data transfer between pymysql and SQL Server. Basically, the user gives a select statement and the statement will be executed on the source database and the output will be transferred to a specified target table. Now, I would like to implement an integrity check via hash functions.

The problem: In theory this is quite easy, I can calculate the hash function based on the two output tuples in python and compare them. However, the problem is: the datasets are large, so I'm transferring chunks to save memory. Meaning, I'm not able to calculate the hash for the whole dataset. As I don't have any influence on the structure of the select statement the user executes I cannot guarantee the chunks are ordered, hence the order could be different, and the chunks are different.

Solutions I thought about

  1. read every row calculate a very small hash, add them all to a list and then calculate a final hash value. But this is not very elegant and still will consume a lot of memory.

  2. Outsource hash calculation to the DBMS, however, until now I cannot find any good resources on this. The hash algorithms generated by the mysql and SQL Server must obviously be the same.

The problem can be abstracted, the question boils down to find the best way to calculate a hash function of a big unordered dataset while only having access to each chunk at a time. So I'm happy with answers that are not directly related. Thank you.

I already implemented this for checking two SQL tables. In this case if you know the primary key you can sort the tables load them and calculate a hash for each chunk and compare them.

0

There are 0 best solutions below