PyDeequ Integration with PySpark: Error 'JavaPackage' object is not callable

28 Views Asked by At

I'm trying to integrate PyDeequ with PySpark in my Streamlit application to perform comprehensive data quality checks on a CSV file. I want to use PyDeequ's functionalities to perform various tests including completeness, correctness, uniqueness, outlier detection, and date format correctness. However, I'm encountering an error that says the 'JavaPackage' object is not callable. Here's the relevant code snippet, the specific tests I'm trying to perform, and the error message:

import streamlit as st
from pyspark.sql import SparkSession
from pydeequ import AnalysisRunner
from pydeequ.analyzers import Completeness

def create_spark_session():
    return SparkSession.builder.appName("DataQualityCheck").getOrCreate()

def read_csv_data(spark, uploaded_file):
    df = spark.read.csv(uploaded_file, header=True, inferSchema=True)
    return df

def main():
    st.title("Data Quality Checker")
    uploaded_file = st.file_uploader("Choose a CSV file:", key="csv_uploader", type="csv")
    if uploaded_file is not None:
        spark = create_spark_session()
        df = read_csv_data(spark, uploaded_file)
        analysis_runner = AnalysisRunner(spark)
        analysis_result = analysis_runner.onData(df).addAnalyzer(Completeness("MRN")).run()
        completeness_results = analysis_result['Completeness']
        
        completeness_mrn = completeness_results['MRN']
        completeness_percent_mrn = completeness_mrn['completeness']
        missing_count_mrn = completeness_mrn['count']
        
if __name__ == "__main__":
    main()
TypeError: 'JavaPackage' object is not callable
Traceback:
File "E:\Deequ\pydeequ_env\lib\site-packages\streamlit\runtime\scriptrunner\script_runner.py", line 542, in _run_script
    exec(code, module.__dict__)
File "E:\data_quality.py", line 43, in <module>
    completeness_mrn = completeness_results['MRN']
File "E:\Deequ\pydeequ_env\lib\site-packages\pydeequ\analyzers.py", line 52, in onData
    return AnalysisRunBuilder(self._spark_session, df)
File "E:\Deequ\pydeequ_env\lib\site-packages\pydeequ\analyzers.py", line 124, in __init__
    self._AnalysisRunBuilder = self._jvm.com.amazon.deequ.analyzers.runners.AnalysisRunBu

Data Quality Tests:

  1. Completeness: Ensure that certain columns (e.g., "MRN" and "Date of Admission") have complete data.
  2. Correctness: Verify that data in specific columns adhere to certain format or correctness rules (e.g., "MRN" format correctness).
  3. Uniqueness: Check if certain columns contain unique values (e.g., "MRN" uniqueness).
  4. Outlier Detection: Identify any outliers in numerical columns (e.g., "Billing Amount").
  5. Date Future Format: Ensure that dates in a certain column (e.g., "Date of Admission") are not in the future.

I have installed PyDeequ version 1.2.0 and PySpark downgraded version 3.3.1 in my environment. Could someone please help me understand why I'm encountering this error and how to resolve it?

0

There are 0 best solutions below