I am stating to explore spark to speed up processing and have been recently exploring spark datacompy.
The only issue I am having is that I cannot find a way to print a sample of unequal values when found. Using the non-spark version of datacompy this is just part of the standard run.
It does not appear to be an option when I review the manual on the web; has anyone else had this issue and found a solution?
Thanks in advance for any help!
Non-spark datacompy report
DataComPy Comparison
DataFrame Summary
DataFrame Columns Rows 0 original 5 6 1 new 4 5
Column Summary
Number of columns in common: 4 Number of columns in original but not in new: 1 Number of columns in new but not in original: 0
Row Summary
Matched on: acct_id Any duplicates on match values: Yes Absolute Tolerance: 0.0001 Relative Tolerance: 0 Number of rows in common: 5 Number of rows in original but not in new: 1 Number of rows in new but not in original: 0
Number of rows with some compared columns unequal: 5 Number of rows with all compared columns equal: 0
Column Comparison
Number of columns compared with some values unequal: 3 Number of columns compared with all values equal: 1 Total number of values which compare unequal: 7
Columns with Unequal Values or Types
Column original dtype new dtype # Unequal Max Diff # Null Diff
0 dollar_amt float64 float64 1 0.0500 0 1 float_fld float64 float64 4 0.0005 3 2 name object object 2 0.0000 0
**Sample Rows with Unequal Values
acct_id dollar_amt (original) dollar_amt (new)
0 10000001234 123.45 123.4 acct_id float_fld (original) float_fld (new) 0 10000001234 14530.1555 14530.155 5 10000001238 NaN 111.000 2 10000001236 NaN 1.000 1 10000001235 1.0000 NaN acct_id name (original) name (new) 0 10000001234 George Maharis George Michael Bluth 3 10000001237 Bob Loblaw Robert Loblaw**
Sample Rows Only in original (First 10 Columns)
strong text acct_id dollar_amt name float_fld date_fld 4 10000001238 1.05 Lucille Bluth NaN 2017-01-01
Spark datacompy report
****** Column Summary ******
Number of columns in common with matching schemas: 3 Number of columns in common with schema differences: 2 Number of columns in base but not compare: 0 Number of columns in compare but not base: 0
****** Schema Differences ****** Base Column Name Compare Column Name Base Dtype Compare Dtype
open_dt AM00_DATE_ACCOUNT_OPEN date bigint tbal_cd AM0B_FC_TBAL string double
****** Row Summary ****** Number of rows in common: 5 Number of rows in base but not compare: 0 Number of rows in compare but not base: 0 Number of duplicate rows found in base: 0 Number of duplicate rows found in compare: 0
****** Row Comparison ****** Number of rows with some columns unequal: 5 Number of rows with all columns equal: 0
****** Column Comparison ****** Number of columns compared with unexpected differences in some values: 1 Number of columns compared with all values equal but known differences found: 2 Number of columns compared with all values completely equal: 0
****** Columns with Unequal Values ****** Base Column Name Compare Column Name Base Dtype Compare Dtype # Matches # Known Diffs # Mismatches
clsd_reas_cd AM00_STATC_CLOSED string string 2 2 1 open_dt AM00_DATE_ACCOUNT_OPEN date bigint 0 5 0 tbal_cd AM0B_FC_TBAL string double 0 5 0