AWS DataQuality Rules should fail but passed for null value

247 Views Asked by At

I have a csv file with 8 columns. within the columns i purposely deleted some cells.

When i tried to run a Glue DataQuality job, for IsComplete, the result passed (which is not supposed to) for one column , but the rest of the columns failed as expected. Why is this behaviour shown?

Sample data: enter image description here

Rule used in DQDL :

Rules = [
    IsComplete "Gender",
    Completeness "Gender" = 1,
    IsComplete "Age",
    IsComplete "Spending Score (1-100)",
    IsComplete "annual income ($)",
    IsComplete "Family Size",
    IsComplete "work experience"
]

DQ outcome: enter image description here

By right Rule 1 and Rule 2 should fail as well..

2

There are 2 best solutions below

0
EnterPassword On

Blank CSV values are treated as empty string I believe, which may not equate to NULL in the data quality check. I think you may need to do some casting on that column, or provide a schema for it to work correctly!

0
Mallikarjun On

To resolve this, convert empty strings in your data to NULL/None in the dynamicframe, before evaluating the data quality rule-set.

def empty_string_to_null(row):
    for key, value in row.items():
        if value == "":
            row[key] = None
    return row
    
# Convert empty strings to null
modified_dynamic_frame = Map.apply(frame=dynamic_frame, f=empty_string_to_null)

"""
EvaluateDataQualityNode = EvaluateDataQuality().process_rows(
    frame=modified_dynamic_frame,
    ruleset=Rules,........
)
"""