I'm checking out Deequ which seems like a really nice library. I was wondering if it is possible to load constraints from a csv file or an orc-table in HDFS?
Lets say I have a table with theese types
case class Item(
id: Long,
productName: String,
description: String,
priority: String,
numViews: Long
)
and I want to put constraints like:
val checks = Check(CheckLevel.Error, "unit testing my data")
.isComplete("id") // should never be NULL
.isUnique("id") // should not contain duplicates
But I want to load the ".isComplete("id")", ".isUnique("id")" from a csv file so the business can add the constraints and we can run te tests based on their input
val verificationResult = VerificationSuite()
.onData(data)
.addChecks(Seq(checks))
.run()
I've managed to get the constraints from suggestionResult.constraintSuggestion
val allConstraints = suggestionResult.constraintSuggestions
.flatMap { case (_, suggestions) => suggestions.map { _.constraint }}
.toSeq
which gives a List like for example:
allConstraints = List(CompletenessConstraint(Completeness(id,None)), ComplianceConstraint(Compliance('id' has no negative values,id >= 0,None))
But it gets generated from suggestionResult.constraintSuggestions. But I want to be able to create a List like that based on the inputs from a csv file, can anyone help me?
To sum things up: Basically I just want to add:
val checks = Check(CheckLevel.Error, "unit testing my data")
.isComplete("columnName1")
.isUnique("columnName1")
.isComplete("columnName2")
dynamically based on a file where the file has for example:
columnName;isUnique;isComplete (header)
columnName1;true;true
columnName2;false;true
It depends on how complicated you want to allow the constraints to be. In general, deequ allows you to use arbitrary scala code for the validation function of a constraint, so its difficult (and dangerous from a security perspective) to load that from a file.
I think you would have to come up with your own schema and semantics for the CSV file, at least it is not directly supported in deequ.