How to Dynamically Create Spark ML Pipeline

585 Views Asked by At

I am building a simple web service where an user can easily construct a Spark ML pipeline in the UI and persist it so that the user can retrieve the saved pipeline and start to train it.

Here is the idea:

  1. User can select a model and evaluator and fill in the parameters to form a pipeline in the UI
  2. User can specify the output directory to save this pipeline
  3. User hits save and the pipeline will be created under the specified directory

After brainstorming, I got follwoing implementation idea:

  1. Spin up two server: Web server and Spark cluster
  2. When user hits the save button, export the user defined pipeline metadata in JSON format and send it to the Spark cluster
  3. The Spark cluster takes the JSON and instantiates the pipeline in SparkContext
  4. Save the instantiated pipeline in the specified directory by using Spark ML Persistence

The challenge I am facing now is how to convert and export the pipeline metadata into JSON, and consequently how to parse and instantiate a pipeline from JSON in Spark (in the step 2 and 3).

I believe I can write a simple converter and parser by myself, but I am just wondering if there is any libraries or frameworks I can use to get me started.


Update

Because there is no code involved in front-end, I cannot use the Spark's ML persistence or MLeap.

2

There are 2 best solutions below

3
On

If you use spark ml's format to save the json from the web server, you can just load it and that will create a pipeline. Looking at the serialized json and the code to generate it, it seems straightforward to do so.

0
On

Take a look at MLeap, it support most Spark ML pipelines feature transformers and estimators. You have an option to serialize to json or to protobuf for really large models (i.e. Random Forest)