I have a PySpark notebook running in AWS EMR. In my specific case, I want to use pyspark2pmml
to create pmml
for a model I just trained. However, I get the following error (when running pyspark2pmml.PMMLBuilder
but I don't think that matters).
JPMML-SparkML not found on classpath
Traceback (most recent call last):
File "/tmp/1623111492721-0/lib/python3.7/site-packages/pyspark2pmml/__init__.py", line 14, in __init__
raise RuntimeError("JPMML-SparkML not found on classpath")
RuntimeError: JPMML-SparkML not found on classpath
I know that this is caused by my Spark session not have reference to the needed class. What I don't know is how to start a Spark session with that class available. I found one other answer using %%conf -f
, but that changed other settings which in turn kept me from using sc.install_pypi_package
, which I also needed.
Is there a way that I could have started the Spark session with that JPMML class available, but without changing any other settings?
So, here's an answer, but not the one I want.
To add that class to the classpath I can start my work with this:
That creates the issue I referenced above, where I don't have the ability to
sc.install_pypi_package
. However, I can add that package in a more manual way. First step was to create a zip file of just the needed modules using the zip from the project's github (in this case, just thepyspark2pmml
directory, instead of the whole zip). Then that module can be added usingsc.addPyFile
After this, I can run the original commands exactly as I expected.