Define Data Quality Rules for Big Data

307 Views Asked by At

Is there any way to define Data quality rules that can be applied over Dataframes. The template to define the rule should be easy enough for any lay man to define and then we can take these rules and convert them to pyspark codes and run them over the data.

I was thinking in line as below.

ID  ProjectID   RuleID  Attribute1  Value1          Condition1  Attribute2  Value2          Condition2  Type    ModifyAttribute ModificationLogic   CustomUDF
1   1           1       SerialNum   6               EQUAL                                               MODIFY  SerialNum   SUBSTR(serialNum,1,6)   
2   1           2       DriverName  ['A','B','C']   VALUEMATCH  Source      ['D','E','F']   IN          REJECT  

If there is any tools or Domain specific language to define the same it would help. If there is any template to define rules which can be applied cross attribute and across multiple tables (join, example country lookup) is also helpful.

1

There are 1 best solutions below

2
On

Surprised no one gave a shot at answering this yet. Typically, for a use case like this, I would use ConfigParser. Based on what your architecture is, you can define sections and rules which can easily be read and executed. But that's something a developer would find easy to use rather than a normal user.

Now that's out of the way, for your use case, as python is a scripting language with a lot of flexibility, you can simply create an excel in the format you have given which will dictate the flow of your data manipulation. I hope this helps in some way. Let me know if you need more info.