How to configure the environment to submit a PyDeequ job to a Spark/YARN (client mode) from a Jupyter notebook. There is no comprehensive explanation other than those using the environment. How to setup the environment to use with non-AWS environment?
There are errors caused such as TypeError: 'JavaPackage' object is not callable
if just follow the example e.g. Testing data quality at scale with PyDeequ.
from pydeequ.analyzers import *
analysisResult = AnalysisRunner(spark) \
.onData(df) \
.addAnalyzer(Size()) \
.addAnalyzer(Completeness("review_id")) \
.addAnalyzer(ApproxCountDistinct("review_id")) \
.addAnalyzer(Mean("star_rating")) \
.addAnalyzer(Compliance("top star_rating", "star_rating >= 4.0")) \
.addAnalyzer(Correlation("total_votes", "star_rating")) \
.addAnalyzer(Correlation("total_votes", "helpful_votes")) \
.run()
analysisResult_df = AnalyzerContext.successMetricsAsDataFrame(spark, analysisResult)
analysisResult_df.show()
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
/tmp/ipykernel_499599/1388970492.py in <module>
1 from pydeequ.analyzers import *
----> 2 analysisResult = AnalysisRunner(spark) \
3 .onData(df) \
4 .addAnalyzer(Size()) \
5 .addAnalyzer(Completeness("review_id")) \
~/home/repository/git/oonisim/aws/venv/lib/python3.8/site-packages/pydeequ/analyzers.py in onData(self, df)
50 """
51 df = ensure_pyspark_df(self._spark_session, df)
---> 52 return AnalysisRunBuilder(self._spark_session, df)
53
54
~/home/repository/git/oonisim/aws/venv/lib/python3.8/site-packages/pydeequ/analyzers.py in __init__(self, spark_session, df)
122 self._jspark_session = spark_session._jsparkSession
123 self._df = df
--> 124 self._AnalysisRunBuilder = self._jvm.com.amazon.deequ.analyzers.runners.AnalysisRunBuilder(df._jdf)
125
126 def addAnalyzer(self, analyzer: _AnalyzerObject):
TypeError: 'JavaPackage' object is not callable
HADOOP_CONF_DIR
Copy the contents of
$HADOOP_HOME/etc/hadoop
from the Hadoop/YARN master node to the local host and set theHADOOP_CONF_DIR
environment variable to point to the directory.PYTHONPATH
pyspark
Need to be able to load the pyspark python modules. Install pyspark with pip or conda which installs the Spark runtime libraries (for standalone). Or copy the pyspark python modules
$SPARK_HOME/python/lib
from the Spark installation.PyDeequ
Install pydeequ with pip or conda. Note that this is not enough to use pydeequ.
Deequ JAR files
Deequ jar to the library path
To use the PyDeequ, need the deequ jar file. Download the one for the Spark/Deequ version from the Maven repository com.amazon.deequ.
Spark Session
Specify the Deequ jar files to the Spark jar properties as specified in:
Deequ Job
Using the excerpt of the Amazon product review data.