I have just started with pydeequ and I want to create checks for spark dataframe that has ~1800 features. Now to know which checks I must perform, I do the following
suggestionResult = ConstraintSuggestionRunner(spark) \
.onData(df) \
.addConstraintRule(DEFAULT()) \
.run()
Following above I get suggestion for the all the checks that I could do on my data. Now the goal is 2 folds.
- I may want to run the checks provided by
suggestionResult
- I may want to run a particular check for e.g. NonNegative, Unique check for a series of features.
I am completely unsure how to do it, after trying several ways, It still doesnt work, while I know its certainly possible to run all suggestion check at once but only in scala see this (I need to do this in pydeequ as per my point 1)
I did attempt to do the following way but it didnt work. gave me an error on duplicate analyzers
check_list = [check.isNonNegative,check.isPositive]
checkResultBuilder = VerificationSuite(spark).onData(df)
for col in sub_cols:
checkResultBuilder = reduce(
lambda vbuilder,checker: vbuilder.addCheck(checker(col)),check_list,checkResultBuilder)
checkResultBuilder.run()
If helpful for anyone, here's a full example showing how to generate suggested data quality constraints and then check all of them.
Note, this example uses PyDeequ, which is the Python implementation of Scala's Deequ. This question specifically mentioned Deequ, but PyDeequ has a very similar suite of APIs. I built this solution partially off @mlin's solution.
First, let's create a string that is a concatenation of all the suggested constraints:
At this point, our
pydeequ_validation_string
might look like:Now, let's take our
pydeequ_validation_string
and use it to check all these constraints at once. Here's a function to do this. Note, I'm first concatenating our string with"check"
, and then using Python'seval
to evaluate this string as code.