SQL for validating/reconciling data across two tables, when each is a copy of the other

6.4k Views Asked by At

I am looking to validate data across 2 tables, where the TableA (in the users source database - DB2) is a copy of the TableB (in the Data warehouse - DB2/Netezza).

Due to the process of copying the data across not being stable and being run on a daily basis instead of in real-time, there are discrepancies in the target DB, which I need to figure out and find. Also, do not wish to put a big load on the server, which might disrupt normal operations of the organization.

I am doing the following currently:

a) SUM checks on all numeric fields 
b) MAX / MIN checks
c) Row Count checks (COUNT(*))
d) Checking the Data Types on both the DB's to be equal
e) Checking the DISTINCT count of the PRIMARY fields
f) Check MAX of Date on which the row was copied over (we put a new column to specify when that row was copied over)

Aside from the above, is there any other simpler way of doing it in a single SQL Query, which does not use up too many resources on the Database Servers?

Also, does anyone know of a way to match a row to row in DB2-DB2 and DB2-Netezza tables, like one can do with a SAS table?

Note:

a) Netezza does not have the concept of UNIQUE or PRIMARY, so Referential Integrity and UNIQUE checking is not valid.
b) All the tables have > 100M rows, so running a simple RowCount on each table can sometimes take upto 30-45 mins
1

There are 1 best solutions below

2
On

I'm wondering whether you have a field or two in your data that would let you identify it as the delta data or differential data, if you don't then you might need to review that data model. Also, I would definitely pay attention to the infrastructure first, and make that data copying process stable and reliable.

On the other hand, it's kind of hard to imagine why to implement a copy of data in a data warehose somewhere else, that data is supposed to be queried for analytical purposes, with data warehouse reporting tools, and it's usually at least a day old. So nothing real time there.

The query you are looking for probably exists, but in the system you describe with >100M rows and 45 mins RowCounts, maybe, just maybe it might take forever and a day.

Bottom line is, with all due respect you might be looking in the wrong direction.