I am working on a Data Quality Monitoring project which is new me. I started with a Data Profiling to analyse my data and have a global view of it. Next, i thought about defining some data quality rules, but i'm a little bit confused about how to implement these rules. If u guys can guide me a little bit as i'm totally new to this.
Data Quality Process - defining rules
476 Views Asked by AudioBubble At
1
There are 1 best solutions below
Related Questions in PYTHON
- new thread blocks main thread
- Extracting viewCount & SubscriberCount from YouTube API V3 for a given channel, where channelID does not equal userID
- Display images on Django Template Site
- Difference between list() and dict() with generators
- How can I serialize a numpy array while preserving matrix dimensions?
- Protractor did not run properly when using browser.wait, msg: "Wait timed out after XXXms"
- Why is my program adding int as string (4+7 = 47)?
- store numpy array in mysql
- how to omit the less frequent words from a dictionary in python?
- Update a text file with ( new words+ \n ) after the words is appended into a list
- python how to write list of lists to file
- Removing URL features from tokens in NLTK
- Optimizing for Social Leaderboards
- Python : Get size of string in bytes
- What is the code of the sorted function?
Related Questions in MONITORING
- How to get raw hadoop metrics
- Nodejs ZMQ monitoring sockets
- Ambari Monitoring raw data
- Monitoring an applications performance within Visual studio
- how to monitor mesos frameworks
- PDF report in zabbix 2.2.9
- See data that an app is secretly sending to web server in the background
- Monitor Hadoop Cluster using Collectl
- Questions Nagios Monitoring
- NewRelic says "No data reporting for this application"
- How to monitor API calls on EC2?
- IIS Monitoring Through Zabbix
- Hadoop and Spark Monitoring and alert tool(Open source tools)
- Centreon/Icinga: command by services
- New Relic not logging custom parameters on transactions
Related Questions in DATA-QUALITY
- Working with inaccurate (incorrect) dataset
- from free text to list of value
- R - estimating missing values
- Data quality database model
- Informatica Data Quality - Match Analysis
- Data quality with Ruby
- Data Quality Scan in Dataplex with Python library
- How to actually measure/compute data quality
- AWS data quality not getting results when job name contains a "/"
- Data quality rule on REDCap returns two associated discrepancies
- "Error Category: UNCLASSIFIED_ERROR; TypeError: unhashable type: 'list'"
- Define Data Quality Rules for Big Data
- How to get sum of multiple rows in a table dynamically
- Spark Compatible Data Quality Framework for Narrow Data
- Data Quality Framework in AWS
Trending Questions
- UIImageView Frame Doesn't Reflect Constraints
- Is it possible to use adb commands to click on a view by finding its ID?
- How to create a new web character symbol recognizable by html/javascript?
- Why isn't my CSS3 animation smooth in Google Chrome (but very smooth on other browsers)?
- Heap Gives Page Fault
- Connect ffmpeg to Visual Studio 2008
- Both Object- and ValueAnimator jumps when Duration is set above API LvL 24
- How to avoid default initialization of objects in std::vector?
- second argument of the command line arguments in a format other than char** argv or char* argv[]
- How to improve efficiency of algorithm which generates next lexicographic permutation?
- Navigating to the another actvity app getting crash in android
- How to read the particular message format in android and store in sqlite database?
- Resetting inventory status after order is cancelled
- Efficiently compute powers of X in SSE/AVX
- Insert into an external database using ajax and php : POST 500 (Internal Server Error)
Popular Questions
- How do I undo the most recent local commits in Git?
- How can I remove a specific item from an array in JavaScript?
- How do I delete a Git branch locally and remotely?
- Find all files containing a specific text (string) on Linux?
- How do I revert a Git repository to a previous commit?
- How do I create an HTML button that acts like a link?
- How do I check out a remote Git branch?
- How do I force "git pull" to overwrite local files?
- How do I list all files of a directory?
- How to check whether a string contains a substring in JavaScript?
- How do I redirect to another webpage?
- How can I iterate over rows in a Pandas DataFrame?
- How do I convert a String to an int in Java?
- Does Python have a string 'contains' substring method?
- How do I check if a string contains a specific word?
This is quite ambiguous question but I try to guess a few tips how to start. Since you are a new to data quality and want already implementation hints, lets start from that.
Purpose: Data quality monitoring system wants to a) recognize error and b) trigger next step how to handle it.
First, build a data quality rule for your data set. The rule can be attribute, record, table or cross-table rule. Lets start with attribute level rule. Implement a rule that recognizes that attribute content does not have '@' in it. Run it to email attributes and create an error record for each row that does not have '@' in email attribute. Error record should have these attributes:
ErrorInstanceID; ErrorName; ErrorCategory; ErrorRule; ErrorLevel; ErrorReaction; ErrorScript; SourceSystem; SourceTable; SourceRecord; SourceAttribute; ErrorDate;
"asd2321sa1"; "Email Format Invalid"; "AttributeError"; "Does not contain @"; "Warning|Alert"; "Request new email at next login"; "ScriptID x"; "Excel1"; "Sheet1"; "RowID=34"; "Column=Email"; "1.1.2022"
MONITORING SYSTEM
You need to make above scripts configurable so that you can change systems, tables and columns as well as rules easily. When ran on top of data sets, they will all populate error records to the same structures resulting in a consistent and historical storage of all errors. You should be able to build reports about existing errors in specific systems, trends of errors appearing or getting fixed and so on.
Next, you need to start building a full-sale data quality metadata repository with a proper data model and design a suitable historical versioning for the above information. You need to store information like which rules were ran and when, which systems and tables they checked, and so on. To detect which systems have bee included in monitoring and also to recognize if systems are not monitored with correct rules. In practice, quality monitoring for data quality monitoring system. You should have statistics which systems are monitored with specific rules, when they were ran last time, aggregates of inspected tables, records and errors.
Typically, its more important to focus on errors that need immediate attention and "alert" an end-user to go fix the issue or triggers a new workflow or flag in source system. For example, invalid emails might be categorized as alerts and be just aggregate statistics. We have 2134223 invalid emails. Nobody cares. However, it might be more important to recognize invalid email of a person who has ordered his bills as digital invoices to his email. Alert. That kind of error (Invalid Email AND Email Invoicing) should trigger an alert and set up a flag in CRM for end users to try get email fixed. There should not be any error records for this error. But this kind of rule should be ran on top of all systems that store customer contact and billind preferences.
For a technical person, I could recommend this book. It's a good book that goes deeper in technical and logical issues of data quality assessment and monitoring systems. There is also a small metadata model for data quality metadata structures. https://www.amazon.com/Data-Quality-Assessment-Arkady-Maydanchik/dp/0977140024/