How to include sample rows with unequal values on spark version of datacompy

409 Views Asked by At

I am stating to explore spark to speed up processing and have been recently exploring spark datacompy.

The only issue I am having is that I cannot find a way to print a sample of unequal values when found. Using the non-spark version of datacompy this is just part of the standard run.

It does not appear to be an option when I review the manual on the web; has anyone else had this issue and found a solution?

Thanks in advance for any help!

Non-spark datacompy report

DataComPy Comparison

DataFrame Summary

DataFrame Columns Rows 0 original 5 6 1 new 4 5

Column Summary

Number of columns in common: 4 Number of columns in original but not in new: 1 Number of columns in new but not in original: 0

Row Summary

Matched on: acct_id Any duplicates on match values: Yes Absolute Tolerance: 0.0001 Relative Tolerance: 0 Number of rows in common: 5 Number of rows in original but not in new: 1 Number of rows in new but not in original: 0

Number of rows with some compared columns unequal: 5 Number of rows with all compared columns equal: 0

Column Comparison

Number of columns compared with some values unequal: 3 Number of columns compared with all values equal: 1 Total number of values which compare unequal: 7

Columns with Unequal Values or Types

   Column original dtype new dtype  # Unequal  Max Diff  # Null Diff

0 dollar_amt float64 float64 1 0.0500 0 1 float_fld float64 float64 4 0.0005 3 2 name object object 2 0.0000 0

**Sample Rows with Unequal Values

   acct_id  dollar_amt (original)  dollar_amt (new)

0 10000001234 123.45 123.4 acct_id float_fld (original) float_fld (new) 0 10000001234 14530.1555 14530.155 5 10000001238 NaN 111.000 2 10000001236 NaN 1.000 1 10000001235 1.0000 NaN acct_id name (original) name (new) 0 10000001234 George Maharis George Michael Bluth 3 10000001237 Bob Loblaw Robert Loblaw**

Sample Rows Only in original (First 10 Columns)

strong text acct_id dollar_amt name float_fld date_fld 4 10000001238 1.05 Lucille Bluth NaN 2017-01-01

Spark datacompy report

****** Column Summary ******

Number of columns in common with matching schemas: 3 Number of columns in common with schema differences: 2 Number of columns in base but not compare: 0 Number of columns in compare but not base: 0

****** Schema Differences ****** Base Column Name Compare Column Name Base Dtype Compare Dtype


open_dt AM00_DATE_ACCOUNT_OPEN date bigint tbal_cd AM0B_FC_TBAL string double

****** Row Summary ****** Number of rows in common: 5 Number of rows in base but not compare: 0 Number of rows in compare but not base: 0 Number of duplicate rows found in base: 0 Number of duplicate rows found in compare: 0

****** Row Comparison ****** Number of rows with some columns unequal: 5 Number of rows with all columns equal: 0

****** Column Comparison ****** Number of columns compared with unexpected differences in some values: 1 Number of columns compared with all values equal but known differences found: 2 Number of columns compared with all values completely equal: 0

****** Columns with Unequal Values ****** Base Column Name Compare Column Name Base Dtype Compare Dtype # Matches # Known Diffs # Mismatches


clsd_reas_cd AM00_STATC_CLOSED string string 2 2 1 open_dt AM00_DATE_ACCOUNT_OPEN date bigint 0 5 0 tbal_cd AM0B_FC_TBAL string double 0 5 0

0

There are 0 best solutions below