I am running 100 queries (test cases) to check for data quality in hive/impala. The majority of the queries check for null values based on some conditions. I am using conditional aggregation to count the trivial test cases like below. I want to add a more complex query condition to this type of check. I also would like to see the counts if there are nulls.
I want to know how to incorporate the more complex query and also add a count if there are nulls present. Expected output below.
What I have so far:
SELECT (CASE WHEN COUNT(*) = COUNT(car_type) THEN 'PASS' ELSE 'FAIL' END) as car_type_test,
(CASE WHEN COUNT(*) = COUNT(car_color) THEN 'PASS' ELSE 'FAIL' END) as car_color_test,
(CASE WHEN COUNT(*) = COUNT(car_sale) THEN 'PASS' ELSE 'FAIL' END) as car_sale_test
FROM car_data;
More complex type query to add:
SELECT Count(*),
car_job
FROM car_data
WHERE car_job NOT IN ( "car_type", "car_license", "car_cancellation",
"car_color", "car_contract", "car_metal", "car_number" )
OR car_job IS NULL
GROUP BY car_job
Example expected output:
car_type_test car_color_test car_sale_test car_job_test
PASS PASS PASS FAIL
102
I would recommend putting this on one row instead of two: