I found the same discussion in comments section of Create a custom Transformer in PySpark ML, but there is no clear answer. There is also an unresolved JIRA corresponding to that: https://issues.apache.org/jira/browse/SPARK-17025.
Given that there is no option provided by Pyspark ML pipeline for saving a custom transformer written in python, what are the other options to get it done? How can I implement the _to_java method in my python class that returns a compatible java object?
As of Spark 2.3.0 there's a much, much better way to do this.
Simply extend
DefaultParamsWritable
andDefaultParamsReadable
and your class will automatically havewrite
andread
methods that will save your params and will be used by thePipelineModel
serialization system.The docs were not really clear, and I had to do a bit of source reading to understand this was the way that deserialization worked.
PipelineModel.read
instantiates aPipelineModelReader
PipelineModelReader
loads metadata and checks if language is'Python'
. If it's not, then the typicalJavaMLReader
is used (what most of these answers are designed for)PipelineSharedReadWrite
is used, which callsDefaultParamsReader.loadParamsInstance
loadParamsInstance
will findclass
from the saved metadata. It will instantiate that class and call.load(path)
on it. You can extendDefaultParamsReader
and get theDefaultParamsReader.load
method automatically. If you do have specialized deserialization logic you need to implement, I would look at thatload
method as a starting place.On the opposite side:
PipelineModel.write
will check if all stages are Java (implementJavaMLWritable
). If so, the typicalJavaMLWriter
is used (what most of these answers are designed for)PipelineWriter
is used, which checks that all stages implementMLWritable
and callsPipelineSharedReadWrite.saveImpl
PipelineSharedReadWrite.saveImpl
will call.write().save(path)
on each stage.You can extend
DefaultParamsWriter
to get theDefaultParamsWritable.write
method that saves metadata for your class and params in the right format. If you have custom serialization logic you need to implement, I would look at that andDefaultParamsWriter
as a starting point.Ok, so finally, you have a pretty simple transformer that extends Params and all your parameters are stored in the typical Params fashion:
Now we can use it:
Result: