I found the same discussion in comments section of Create a custom Transformer in PySpark ML, but there is no clear answer. There is also an unresolved JIRA corresponding to that: https://issues.apache.org/jira/browse/SPARK-17025.
Given that there is no option provided by Pyspark ML pipeline for saving a custom transformer written in python, what are the other options to get it done? How can I implement the _to_java method in my python class that returns a compatible java object?
As of Spark 2.3.0 there's a much, much better way to do this.
Simply extend
DefaultParamsWritableandDefaultParamsReadableand your class will automatically havewriteandreadmethods that will save your params and will be used by thePipelineModelserialization system.The docs were not really clear, and I had to do a bit of source reading to understand this was the way that deserialization worked.
PipelineModel.readinstantiates aPipelineModelReaderPipelineModelReaderloads metadata and checks if language is'Python'. If it's not, then the typicalJavaMLReaderis used (what most of these answers are designed for)PipelineSharedReadWriteis used, which callsDefaultParamsReader.loadParamsInstanceloadParamsInstancewill findclassfrom the saved metadata. It will instantiate that class and call.load(path)on it. You can extendDefaultParamsReaderand get theDefaultParamsReader.loadmethod automatically. If you do have specialized deserialization logic you need to implement, I would look at thatloadmethod as a starting place.On the opposite side:
PipelineModel.writewill check if all stages are Java (implementJavaMLWritable). If so, the typicalJavaMLWriteris used (what most of these answers are designed for)PipelineWriteris used, which checks that all stages implementMLWritableand callsPipelineSharedReadWrite.saveImplPipelineSharedReadWrite.saveImplwill call.write().save(path)on each stage.You can extend
DefaultParamsWriterto get theDefaultParamsWritable.writemethod that saves metadata for your class and params in the right format. If you have custom serialization logic you need to implement, I would look at that andDefaultParamsWriteras a starting point.Ok, so finally, you have a pretty simple transformer that extends Params and all your parameters are stored in the typical Params fashion:
Now we can use it:
Result: