Great expectation: get invalid records

83 Views Asked by At

I'm testing using Great Expectation to get invalid records when they violate the defined rules. From the documentation it says we can specify include_unexpected_rows or return_unexpected_index_query in the result format. However, none of them work for me. I'm applying the expectation on spark data frame, below is my code:

import great_expectations as ge
from great_expectations.dataset.sparkdf_dataset import SparkDFDataset

df = spark.read.table("data_quality_test")
df_ge = SparkDFDataset(df)
result_format={
        "result_format": "COMPLETE",
        "include_unexpected_rows": True
    }
result = df_ge.expect_column_values_to_be_in_type_list("page_title", ["DateType"],  result_format=result_format)
print(result)

Could anyone please help in figuring out what could be the problem?

1

There are 1 best solutions below

0
James On BEST ANSWER

I think there are two things going in in your example:

  1. To get the complete rows back, you need to have an expectation that evaluates individual rows, but expect_column_values_to_be_in_type_list in spark will just check the type of the whole column.
  2. You have to use the newer GX datasource API to get complete rows. It's a bit more verbose (I know that's being fixed in the new 1.0 api coming shortly), but it would look like this (notice I changed to expect_column_values_to_be_in_set so it will check row-wise):
import great_expectations as gx

context = gx.get_context()
asset = context.sources.add_spark("spark").add_dataframe_asset("data_quality_test")
df = spark.read.table("data_quality_test")

validator = context.get_validator(batch_request=asset.build_batch_request(dataframe=df))
result_format={
        "result_format": "COMPLETE",
        "include_unexpected_rows": True
    }
result = validator.expect_column_values_to_be_in_set("page_title", ["foo"], result_format=result_format)
print(result)