I am working on a Data Quality Monitoring project which is new me. I started with a Data Profiling to analyse my data and have a global view of it. Next, i thought about defining some data quality rules, but i'm a little bit confused about how to implement these rules. If u guys can guide me a little bit as i'm totally new to this.
Data Quality Process - defining rules
454 Views Asked by AudioBubble At
1
There are 1 best solutions below
Related Questions in PYTHON
- Initialize matrix
- Why is my program adding int as string (4+7 = 47)?
- How to save gensim LDA topics output to csv along with the scores?
- Update a text file with ( new words+ \n ) after the words is appended into a list
- Removing URL features from tokens in NLTK
- python, global name not defined
- Why does collections.OrderedDict use try and except to initialize variables?
- Invalid URL: No host supplied : error while using Request.get(url) in Python
- Python GUI application to copy files one location to another location
- Why I receive CERTIFICATE_VERIFY_FAILED from google adwords api?
Related Questions in MONITORING
- Initialize matrix
- Why is my program adding int as string (4+7 = 47)?
- How to save gensim LDA topics output to csv along with the scores?
- Update a text file with ( new words+ \n ) after the words is appended into a list
- Removing URL features from tokens in NLTK
- python, global name not defined
- Why does collections.OrderedDict use try and except to initialize variables?
- Invalid URL: No host supplied : error while using Request.get(url) in Python
- Python GUI application to copy files one location to another location
- Why I receive CERTIFICATE_VERIFY_FAILED from google adwords api?
Related Questions in DATA-QUALITY
- Initialize matrix
- Why is my program adding int as string (4+7 = 47)?
- How to save gensim LDA topics output to csv along with the scores?
- Update a text file with ( new words+ \n ) after the words is appended into a list
- Removing URL features from tokens in NLTK
- python, global name not defined
- Why does collections.OrderedDict use try and except to initialize variables?
- Invalid URL: No host supplied : error while using Request.get(url) in Python
- Python GUI application to copy files one location to another location
- Why I receive CERTIFICATE_VERIFY_FAILED from google adwords api?
Trending Questions
- UIImageView Frame Doesn't Reflect Constraints
- Is it possible to use adb commands to click on a view by finding its ID?
- How to create a new web character symbol recognizable by html/javascript?
- Why isn't my CSS3 animation smooth in Google Chrome (but very smooth on other browsers)?
- Heap Gives Page Fault
- Connect ffmpeg to Visual Studio 2008
- Both Object- and ValueAnimator jumps when Duration is set above API LvL 24
- How to avoid default initialization of objects in std::vector?
- second argument of the command line arguments in a format other than char** argv or char* argv[]
- How to improve efficiency of algorithm which generates next lexicographic permutation?
- Navigating to the another actvity app getting crash in android
- How to read the particular message format in android and store in sqlite database?
- Resetting inventory status after order is cancelled
- Efficiently compute powers of X in SSE/AVX
- Insert into an external database using ajax and php : POST 500 (Internal Server Error)
Popular # Hahtags
Popular Questions
- How do I undo the most recent local commits in Git?
- How can I remove a specific item from an array in JavaScript?
- How do I delete a Git branch locally and remotely?
- Find all files containing a specific text (string) on Linux?
- How do I revert a Git repository to a previous commit?
- How do I create an HTML button that acts like a link?
- How do I check out a remote Git branch?
- How do I force "git pull" to overwrite local files?
- How do I list all files of a directory?
- How to check whether a string contains a substring in JavaScript?
- How do I redirect to another webpage?
- How can I iterate over rows in a Pandas DataFrame?
- How do I convert a String to an int in Java?
- Does Python have a string 'contains' substring method?
- How do I check if a string contains a specific word?
This is quite ambiguous question but I try to guess a few tips how to start. Since you are a new to data quality and want already implementation hints, lets start from that.
Purpose: Data quality monitoring system wants to a) recognize error and b) trigger next step how to handle it.
First, build a data quality rule for your data set. The rule can be attribute, record, table or cross-table rule. Lets start with attribute level rule. Implement a rule that recognizes that attribute content does not have '@' in it. Run it to email attributes and create an error record for each row that does not have '@' in email attribute. Error record should have these attributes:
ErrorInstanceID; ErrorName; ErrorCategory; ErrorRule; ErrorLevel; ErrorReaction; ErrorScript; SourceSystem; SourceTable; SourceRecord; SourceAttribute; ErrorDate;
"asd2321sa1"; "Email Format Invalid"; "AttributeError"; "Does not contain @"; "Warning|Alert"; "Request new email at next login"; "ScriptID x"; "Excel1"; "Sheet1"; "RowID=34"; "Column=Email"; "1.1.2022"
MONITORING SYSTEM
You need to make above scripts configurable so that you can change systems, tables and columns as well as rules easily. When ran on top of data sets, they will all populate error records to the same structures resulting in a consistent and historical storage of all errors. You should be able to build reports about existing errors in specific systems, trends of errors appearing or getting fixed and so on.
Next, you need to start building a full-sale data quality metadata repository with a proper data model and design a suitable historical versioning for the above information. You need to store information like which rules were ran and when, which systems and tables they checked, and so on. To detect which systems have bee included in monitoring and also to recognize if systems are not monitored with correct rules. In practice, quality monitoring for data quality monitoring system. You should have statistics which systems are monitored with specific rules, when they were ran last time, aggregates of inspected tables, records and errors.
Typically, its more important to focus on errors that need immediate attention and "alert" an end-user to go fix the issue or triggers a new workflow or flag in source system. For example, invalid emails might be categorized as alerts and be just aggregate statistics. We have 2134223 invalid emails. Nobody cares. However, it might be more important to recognize invalid email of a person who has ordered his bills as digital invoices to his email. Alert. That kind of error (Invalid Email AND Email Invoicing) should trigger an alert and set up a flag in CRM for end users to try get email fixed. There should not be any error records for this error. But this kind of rule should be ran on top of all systems that store customer contact and billind preferences.
For a technical person, I could recommend this book. It's a good book that goes deeper in technical and logical issues of data quality assessment and monitoring systems. There is also a small metadata model for data quality metadata structures. https://www.amazon.com/Data-Quality-Assessment-Arkady-Maydanchik/dp/0977140024/