I am building a simple web service where an user can easily construct a Spark ML pipeline in the UI and persist it so that the user can retrieve the saved pipeline and start to train it.
Here is the idea:
- User can select a model and evaluator and fill in the parameters to form a pipeline in the UI
- User can specify the output directory to save this pipeline
- User hits save and the pipeline will be created under the specified directory
After brainstorming, I got follwoing implementation idea:
- Spin up two server: Web server and Spark cluster
- When user hits the save button, export the user defined pipeline metadata in
JSON
format and send it to the Spark cluster - The Spark cluster takes the
JSON
and instantiates the pipeline inSparkContext
- Save the instantiated pipeline in the specified directory by using
Spark ML Persistence
The challenge I am facing now is how to convert and export the pipeline metadata into JSON
, and consequently how to parse and instantiate a pipeline from JSON
in Spark (in the step 2 and 3).
I believe I can write a simple converter
and parser
by myself, but I am just wondering if there is any libraries or frameworks I can use to get me started.
Update
Because there is no code involved in front-end, I cannot use the Spark's ML persistence or MLeap.
If you use spark ml's format to save the json from the web server, you can just load it and that will create a pipeline. Looking at the serialized json and the code to generate it, it seems straightforward to do so.