How to add to classpath of running PySpark session

290 Views Asked by At

I have a PySpark notebook running in AWS EMR. In my specific case, I want to use pyspark2pmml to create pmml for a model I just trained. However, I get the following error (when running pyspark2pmml.PMMLBuilder but I don't think that matters).

JPMML-SparkML not found on classpath
Traceback (most recent call last):
  File "/tmp/1623111492721-0/lib/python3.7/site-packages/pyspark2pmml/__init__.py", line 14, in __init__
    raise RuntimeError("JPMML-SparkML not found on classpath")
RuntimeError: JPMML-SparkML not found on classpath

I know that this is caused by my Spark session not have reference to the needed class. What I don't know is how to start a Spark session with that class available. I found one other answer using %%conf -f, but that changed other settings which in turn kept me from using sc.install_pypi_package, which I also needed.

Is there a way that I could have started the Spark session with that JPMML class available, but without changing any other settings?

1

There are 1 best solutions below

0
On

So, here's an answer, but not the one I want.

To add that class to the classpath I can start my work with this:

%%configure -f
{
    "jars": [
        "{some_path_to_s3}/jpmml-sparkml-executable-1.5.13.jar"
    ]
}

That creates the issue I referenced above, where I don't have the ability to sc.install_pypi_package. However, I can add that package in a more manual way. First step was to create a zip file of just the needed modules using the zip from the project's github (in this case, just the pyspark2pmml directory, instead of the whole zip). Then that module can be added using sc.addPyFile

sc.addPyFile('{some_path_to_s3}/pyspark2pmml.zip')

After this, I can run the original commands exactly as I expected.