I would like to automate this processes for multiple datasets. I am not sure if there exists a tool like Alteryx or any software where I can do the following.
- Compare 2 similar datasets on a block aggregate level.
- If mismatch exists, zoom into that block to find where the inconsistencies occur.
- Get rows of data with ID where inconsistency happens. The goal is to explore on a high level because of the huge amount of data and open a more comprenhesive view of the inconsistency to point out the error easily.
Here is the data that exists in both SIT and PROD:
| ID | Block | Amt A | Amt B | Amt C | .... |
|---|---|---|---|---|---|
| 1 | X | 1000 | 400 | 1000 | .... |
| 2 | Y | 2000 | 500 | 4000 | .... |
| 3 | Z | 3000 | 500 | 3000 | .... |
| 2 | Y | 2000 | 3000 | 2000 | .... |
| 3 | Z | 3000 | 4500 | 3000 | .... |
- This is my PROD data after aggregating on 'block'
| Block | Counts | Sum Amt A | Sum Amt B |
|---|---|---|---|
| X | 1000 | 40000 | 10000 |
| Y | 2000 | 45000 | 20000 |
| Z | 3000 | 50000 | 30000 |
- This is my SIT data after aggregating on 'block'
| Block | Counts | Sum Amt A | Sum Amt B |
|---|---|---|---|
| X | 1000 | 40000 | 15000 |
| Y | 2000 | 47000 | 20000 |
| Z | 3000 | 50000 | 30000 |
FYI - There are multiple columns like block that could be used on aggregation. One could choose any column, get an aggregation to find inconsistencies and then explore into that particular block.
It's a typical full join, then filter process, you can do it in SQL, Python dataframe or Alteryx, pretty much anything
Same principle in Alteryx, just a join and a selection filter will do