I just started working with GreatExpectations with PySpark so please bear with me in case if I have done something wrong. What I have done so far is to read an excel file and apply a couple of GreatExpectations quality checks on them.
| Col 1 | Col2 |
|---|---|
| val1 | |
| val2 |
I'm using the following snippet to read and apply the expect_column_values_to_not_be_null method:
from pyspark.sql import SparkSession
import pandas as pd
from great_expectations.dataset import SparkDFDataset
# Create a SparkSession
spark = SparkSession.builder.appName("ReadExcel").getOrCreate()
# Read Excel file into a Pandas dataframe
df_pandas = pd.read_excel("sampledata.xlsx", sheet_name='sheet')
# Convert Pandas dataframe to Spark dataframe
df_spark = spark.createDataFrame(df_pandas)
dfForSparkFromGe = SparkDFDataset(df_spark)
mandatory_cols = [
"Col1",
"Col2",
]
def check_not_null_for_mandatory_cols(cols):
for col in cols:
try:
check = dfForSparkFromGe.expect_column_values_to_not_be_null(col)
if check.success:
print(f"no null values found for {col}")
else:
raise Exception(
f"{check.result['unexpected_count']} of {check.result['element_count']} are null for {col}: FAILED")
except AssertionError as e:
print(e)
So from this I'm expecting the program to throw an exception as it doesn't have values for Col1 and 2 in the sample dataset. Even when I see the dfForSparkFromGe dataframe, I can see the value as NaN, so ideally it should throw an exception based on the docs.
Am I missing something here?