How can I embed H2o in a Java application?

2.2k Views Asked by At

I am trying start embedded H2o in a Java application and train a model. However I don't get what exactly explained in the documentation (http://docs.h2o.ai/h2o/latest-stable/h2o-docs/faq/java.html). Can anyone help me by providing an example?

Thanks,

1

There are 1 best solutions below

0
On

The critical thing to understand here is whether you really want to train a model in your application, or do you just want to score a model. Most people initially will just want to score a model.

SCORING

Scoring is easy and natural. See the MOJO and POJO javadoc api here:

Follow the pattern shown in the javadoc to use the Easy API. A snippet of the relevant code is included below:

EasyPredictModelWrapper model = new EasyPredictModelWrapper(MojoModel.load("GBM_model.zip"));
RowData row = new RowData();
row.put("AGE", "68");
...
BinomialModelPrediction p = model.predictBinomial(row);

SCORING AND SAVING FOR DEFERRED TRAINING

What many people will do is score in their live application, and also save new data (somewhere) for deferred training. Then train models offline and push them into production again for scoring. This is a pretty typical model lifecycle which is easy to understand and manage.

TRAINING

Embedding H2O inside your application for actual training is more involved.

If I were going to embed H2O, I would do it one of two ways:

Well-supported option 1. Start an H2O instance as a separate process (or set of processes in the distributed case) and communicate with it using R or Python.

The well documented APIs for H2O are the R API and the Python API. (There is also a REST API with lots of generated documentation, but I would not consider that particularly easy to use.)

You will find lots of documentation and examples at:

Well-supported Option 2. Write a Spark application and use Sparkling Water and Scala or PySparkling and Python.

This doesn't actually require much Spark, since the embedded H2O inside Sparkling Water doesn't actually rely on the Spark side at all. The Scala and Python APIs for Sparkling Water are well-documented. The Sparkling Water User Guide is a good place to start for this:

... And then here are other options which are harder:

(Harder) Option 3. You can include H2O as a maven dependency and call it directly from Java.

The biggest problem here is Java API is not well documented, and you won't find friendly examples for how to use it. The best documentation for the Java API is source code itself, and the unit tests (search for 'test' directories) inside the h2o-3 project github here:

(Harder) Option 4. Some people have called H2O directly from the REST API.

I wouldn't recommend this because it's difficult, but if you want to try, the best way to learn how to use the REST API is to turn on logging from R and look at the message payloads between the R client and H2O:

# R program.
h2o.init()
h2o.startLogging()
h2o.importFile("test.csv")
...