Is there any way to define Data quality rules that can be applied over Dataframes. The template to define the rule should be easy enough for any lay man to define and then we can take these rules and convert them to pyspark codes and run them over the data.
I was thinking in line as below.
ID ProjectID RuleID Attribute1 Value1 Condition1 Attribute2 Value2 Condition2 Type ModifyAttribute ModificationLogic CustomUDF
1 1 1 SerialNum 6 EQUAL MODIFY SerialNum SUBSTR(serialNum,1,6)
2 1 2 DriverName ['A','B','C'] VALUEMATCH Source ['D','E','F'] IN REJECT
If there is any tools or Domain specific language to define the same it would help. If there is any template to define rules which can be applied cross attribute and across multiple tables (join, example country lookup) is also helpful.
Surprised no one gave a shot at answering this yet. Typically, for a use case like this, I would use ConfigParser. Based on what your architecture is, you can define sections and rules which can easily be read and executed. But that's something a developer would find easy to use rather than a normal user.
Now that's out of the way, for your use case, as python is a scripting language with a lot of flexibility, you can simply create an excel in the format you have given which will dictate the flow of your data manipulation. I hope this helps in some way. Let me know if you need more info.