Data Quality Process - defining rules

454 Views Asked by At

I am working on a Data Quality Monitoring project which is new me. I started with a Data Profiling to analyse my data and have a global view of it. Next, i thought about defining some data quality rules, but i'm a little bit confused about how to implement these rules. If u guys can guide me a little bit as i'm totally new to this.

1

There are 1 best solutions below

0
On

This is quite ambiguous question but I try to guess a few tips how to start. Since you are a new to data quality and want already implementation hints, lets start from that.

Purpose: Data quality monitoring system wants to a) recognize error and b) trigger next step how to handle it.

First, build a data quality rule for your data set. The rule can be attribute, record, table or cross-table rule. Lets start with attribute level rule. Implement a rule that recognizes that attribute content does not have '@' in it. Run it to email attributes and create an error record for each row that does not have '@' in email attribute. Error record should have these attributes:

ErrorInstanceID; ErrorName; ErrorCategory; ErrorRule; ErrorLevel; ErrorReaction; ErrorScript; SourceSystem; SourceTable; SourceRecord; SourceAttribute; ErrorDate;

"asd2321sa1"; "Email Format Invalid"; "AttributeError"; "Does not contain @"; "Warning|Alert"; "Request new email at next login"; "ScriptID x"; "Excel1"; "Sheet1"; "RowID=34"; "Column=Email"; "1.1.2022"

MONITORING SYSTEM

You need to make above scripts configurable so that you can change systems, tables and columns as well as rules easily. When ran on top of data sets, they will all populate error records to the same structures resulting in a consistent and historical storage of all errors. You should be able to build reports about existing errors in specific systems, trends of errors appearing or getting fixed and so on.

Next, you need to start building a full-sale data quality metadata repository with a proper data model and design a suitable historical versioning for the above information. You need to store information like which rules were ran and when, which systems and tables they checked, and so on. To detect which systems have bee included in monitoring and also to recognize if systems are not monitored with correct rules. In practice, quality monitoring for data quality monitoring system. You should have statistics which systems are monitored with specific rules, when they were ran last time, aggregates of inspected tables, records and errors.

Typically, its more important to focus on errors that need immediate attention and "alert" an end-user to go fix the issue or triggers a new workflow or flag in source system. For example, invalid emails might be categorized as alerts and be just aggregate statistics. We have 2134223 invalid emails. Nobody cares. However, it might be more important to recognize invalid email of a person who has ordered his bills as digital invoices to his email. Alert. That kind of error (Invalid Email AND Email Invoicing) should trigger an alert and set up a flag in CRM for end users to try get email fixed. There should not be any error records for this error. But this kind of rule should be ran on top of all systems that store customer contact and billind preferences.

For a technical person, I could recommend this book. It's a good book that goes deeper in technical and logical issues of data quality assessment and monitoring systems. There is also a small metadata model for data quality metadata structures. https://www.amazon.com/Data-Quality-Assessment-Arkady-Maydanchik/dp/0977140024/