I have just started with pydeequ and I want to create checks for spark dataframe that has ~1800 features. Now to know which checks I must perform, I do the following
suggestionResult = ConstraintSuggestionRunner(spark) \
.onData(df) \
.addConstraintRule(DEFAULT()) \
.run()
Following above I get suggestion for the all the checks that I could do on my data. Now the goal is 2 folds.
- I may want to run the checks provided by
suggestionResult - I may want to run a particular check for e.g. NonNegative, Unique check for a series of features.
I am completely unsure how to do it, after trying several ways, It still doesnt work, while I know its certainly possible to run all suggestion check at once but only in scala see this (I need to do this in pydeequ as per my point 1)
I did attempt to do the following way but it didnt work. gave me an error on duplicate analyzers
check_list = [check.isNonNegative,check.isPositive]
checkResultBuilder = VerificationSuite(spark).onData(df)
for col in sub_cols:
checkResultBuilder = reduce(
lambda vbuilder,checker: vbuilder.addCheck(checker(col)),check_list,checkResultBuilder)
checkResultBuilder.run()
You can use the method listed here: https://github.com/awslabs/python-deequ/issues/23, then pass the arguments as a list called args, and unpack it as
*args.The Constraint Suggestion Runner returns a dictionary with the constraints in the
constraint_suggestionskey which you can further unpack with a little work reading further inside the dictionary.Use
eval(str)to turn the string form of the extra parameters into the proper objects andget(attr)to add the constraint given the name.