LightGBM model dumped in PMML format gives different predictions from the original model

110 Views Asked by At

I want to use a sklearn model in pyspark.

I have trained a lightGBM model (in sklearn) which gives out probabilities for a propensity problem. Then converted this model to PMML format like this:

from sklearn2pmml import sklearn2pmml
sklearn2pmml(trained_model, 'prod_trained_model.pmml')

And then in pyspark, I read the PMML model like this:

from pypmml_spark import ScoreModel
model_pipeline = ScoreModel.fromFile('prod_trained_model.pmml')

Then I make predictions like this ('features_df' is a pyspark dataframe):

predictions_df = model_pipeline.transform(features_df)

Now the problem is that the model predictions do not match with those of original model. There is a shift of 5% to 10% in the predicted probabilities.

Also for around 5% of the rows in input dataframe, the output probabilities by PMML model are NaN. Whereas for the exact same rows, the original model is predicting fine.

1

There are 1 best solutions below

0
On

This is a common problem with missing / null values.

PMML models do not process nulls the same way as your model.
This is the reason why you should be using .fillna() to impute missing values.